mirror of
https://github.com/kevinveenbirkenbach/duplicate-file-handler.git
synced 2025-09-10 12:17:17 +02:00
Compare commits
24 Commits
c2566a355d
...
main
Author | SHA1 | Date | |
---|---|---|---|
1e5ac1cec8 | |||
206351b067 | |||
8239040659 | |||
2b06bdf6c6 | |||
32669ba528 | |||
89e15dd023 | |||
e31ee231fa | |||
5c4afe2655 | |||
62c6443858 | |||
0c49ca0fcc | |||
8fc0d37a09 | |||
a505cfa8d3 | |||
62dc1e1250 | |||
9e815c6854 | |||
f854e46511 | |||
53b1d8d0fa | |||
1dc5019e4e | |||
ce5db7c6da | |||
f5c7a945c5 | |||
1384a5451f | |||
be44b2984f | |||
3e159df46d | |||
a68530abe0 | |||
7c4010c59e |
7
.github/FUNDING.yml
vendored
Normal file
7
.github/FUNDING.yml
vendored
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
github: kevinveenbirkenbach
|
||||||
|
|
||||||
|
patreon: kevinveenbirkenbach
|
||||||
|
|
||||||
|
buy_me_a_coffee: kevinveenbirkenbach
|
||||||
|
|
||||||
|
custom: https://s.veen.world/paypaldonate
|
3
.gitignore
vendored
3
.gitignore
vendored
@@ -1,2 +1 @@
|
|||||||
test_dir1
|
test_dir*
|
||||||
test_dir2
|
|
96
README.md
96
README.md
@@ -1,38 +1,96 @@
|
|||||||
# Duplicate File Handler
|
# Duplicate File Handler (dufiha) 🔍
|
||||||
|
[](https://github.com/sponsors/kevinveenbirkenbach) [](https://www.patreon.com/c/kevinveenbirkenbach) [](https://buymeacoffee.com/kevinveenbirkenbach) [](https://s.veen.world/paypaldonate)
|
||||||
|
|
||||||
This repository contains two bash scripts for handling duplicate files in a directory and its subdirectories.
|
|
||||||
|
|
||||||
The scripts may need to be modified depending on the specific requirements of your system or the specific use case. They currently operate by comparing the MD5 hash of files to find duplicates, which is a common but not foolproof method.
|
[](./LICENSE) [](https://github.com/kevinveenbirkenbach/duplicate-file-handler/stargazers)
|
||||||
|
|
||||||
## Author
|
Duplicate File Handler is a Python CLI tool for identifying and handling duplicate files within one or more directories based on their MD5 hashes. With flexible file-type filtering and multiple action modes, you can efficiently delete duplicates or replace them with hard or symbolic links.
|
||||||
|
|
||||||
**Kevin Veen-Birkenbach**
|
---
|
||||||
- Email: kevin@veen.world
|
|
||||||
- Website: [https://www.veen.world](https://www.veen.world)
|
|
||||||
|
|
||||||
This repository was created with the help of [OpenAI's ChatGPT](https://chat.openai.com/share/013e4367-8eca-4066-8b18-55457202ba57).
|
## 🛠 Features
|
||||||
|
|
||||||
## Setup
|
- **Duplicate Detection:** Computes MD5 hashes for files to find duplicates.
|
||||||
These scripts will help you manage duplicate files in your directories. Please make sure to adjust permissions on the scripts to be executable with `chmod +x list_duplicates.sh delete_duplicates.sh` before running.
|
- **File Type Filtering:** Process only files with a specified extension.
|
||||||
|
- **Multiple Modification Options:** Choose to delete duplicates, replace them with hard links, or create symbolic links.
|
||||||
|
- **Flexible Modes:** Operate in preview, interactive, or active mode to suit your workflow.
|
||||||
|
- **Parallel Processing:** Utilizes process pooling for efficient scanning of large directories.
|
||||||
|
|
||||||
## Usage
|
---
|
||||||
|
|
||||||
### 1. List Duplicate Files
|
## 📥 Installation
|
||||||
|
|
||||||
`list_duplicates.sh` is a script to list all duplicate files in a specified directory and its subdirectories. For text files, it will also display the diffs.
|
Install Duplicate File Handler via [Kevin's Package Manager](https://github.com/kevinveenbirkenbach/package-manager) under the alias `dufiha`:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./list_duplicates.sh /path/to/directory
|
package-manager install dufiha
|
||||||
```
|
```
|
||||||
|
|
||||||
### 2. Delete Duplicate Files
|
This command installs the tool globally, making it available as `dufiha` in your terminal. 🚀
|
||||||
|
|
||||||
`delete_duplicates.sh` is a script to find and delete duplicate files in a specified directory and its subdirectories. It will ask for confirmation before deleting each file and display the paths of its duplicates.
|
---
|
||||||
|
|
||||||
|
## 🚀 Usage
|
||||||
|
|
||||||
|
Run Duplicate File Handler by specifying one or more directories to scan for duplicates:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
./delete_duplicates.sh /path/to/directory
|
dufiha [options] directory1 directory2 ...
|
||||||
```
|
```
|
||||||
|
|
||||||
## License
|
### Options
|
||||||
|
|
||||||
This project is licensed under the terms of the [GNU Affero General Public License v3.0](https://www.gnu.org/licenses/agpl-3.0.de.html).
|
- **`--apply-to`**: Directories to which modifications should be applied.
|
||||||
|
- **`--modification`**: Action to perform on duplicates:
|
||||||
|
- `delete` – Delete duplicate files.
|
||||||
|
- `hardlink` – Replace duplicates with hard links.
|
||||||
|
- `symlink` – Replace duplicates with symbolic links.
|
||||||
|
- `show` – Only display duplicate files (default).
|
||||||
|
- **`--mode`**: How to apply modifications:
|
||||||
|
- `act` – Execute changes immediately.
|
||||||
|
- `preview` – Preview changes without making any modifications.
|
||||||
|
- `interactive` – Ask for confirmation before processing each duplicate.
|
||||||
|
- **`-f, --file-type`**: Filter by file type (e.g., `.txt` for text files).
|
||||||
|
|
||||||
|
### Example Commands
|
||||||
|
|
||||||
|
- **Preview duplicate `.txt` files in two directories:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dufiha --file-type .txt --mode preview test_dir1 test_dir2
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Interactively delete duplicates in a specific directory:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dufiha --apply-to test_dir2 --modification delete --mode interactive test_dir1 test_dir2
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Show duplicates without modifying any files:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dufiha --mode show test_dir1
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🧑💻 Author
|
||||||
|
|
||||||
|
Developed by **Kevin Veen-Birkenbach**
|
||||||
|
- 📧 [kevin@veen.world](mailto:kevin@veen.world)
|
||||||
|
- 🌐 [https://www.veen.world](https://www.veen.world)
|
||||||
|
|
||||||
|
This project was enhanced with assistance from [OpenAI's ChatGPT](https://chat.openai.com/share/825931d6-1e33-40b0-8dfc-914b3f852eeb).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📜 License
|
||||||
|
|
||||||
|
This project is licensed under the **GNU Affero General Public License, Version 3, 19 November 2007**.
|
||||||
|
See the [LICENSE](./LICENSE) file for details.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🤝 Contributions
|
||||||
|
|
||||||
|
Contributions are welcome! Please feel free to fork the repository, submit pull requests, or open issues to help improve Duplicate File Handler. Let’s make file management smarter and more efficient! 😊
|
||||||
|
41
create_test_file_structure.py
Normal file
41
create_test_file_structure.py
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
import hashlib
|
||||||
|
import random
|
||||||
|
import string
|
||||||
|
|
||||||
|
def create_test_directory(base_dir, num_files=5, duplicate_files=2, depth=1):
|
||||||
|
os.makedirs(base_dir, exist_ok=True)
|
||||||
|
subdirs = [os.path.join(base_dir, f"subdir_{i}") for i in range(depth)]
|
||||||
|
for subdir in subdirs:
|
||||||
|
os.makedirs(subdir, exist_ok=True)
|
||||||
|
|
||||||
|
for dir in [base_dir] + subdirs:
|
||||||
|
file_names = [f"file_{i}.txt" for i in range(num_files)]
|
||||||
|
for file_name in file_names:
|
||||||
|
with open(os.path.join(dir, file_name), 'w') as f:
|
||||||
|
content = ''.join(random.choices(string.ascii_lowercase, k=20))
|
||||||
|
f.write(content)
|
||||||
|
|
||||||
|
for i in range(min(duplicate_files, num_files)):
|
||||||
|
original = os.path.join(dir, file_names[i])
|
||||||
|
for dup_num in range(1, duplicate_files+1):
|
||||||
|
duplicate = os.path.join(dir, f"dup_{dup_num}_{file_names[i]}")
|
||||||
|
shutil.copyfile(original, duplicate)
|
||||||
|
|
||||||
|
def copy_directory_contents(src, dst):
|
||||||
|
if os.path.exists(dst):
|
||||||
|
shutil.rmtree(dst)
|
||||||
|
shutil.copytree(src, dst)
|
||||||
|
|
||||||
|
def create_file_structure(depth, num_files, duplicate_files):
|
||||||
|
base_dirs = ['test_dir1', 'test_dir2']
|
||||||
|
for base_dir in base_dirs:
|
||||||
|
create_test_directory(base_dir, num_files, duplicate_files, depth)
|
||||||
|
|
||||||
|
copy_directory_contents('test_dir1', 'test_dir3')
|
||||||
|
|
||||||
|
print("Test file structure created.")
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
create_file_structure(depth=2, num_files=5, duplicate_files=3)
|
@@ -1,46 +0,0 @@
|
|||||||
import os
|
|
||||||
import shutil
|
|
||||||
import hashlib
|
|
||||||
import random
|
|
||||||
import string
|
|
||||||
|
|
||||||
def create_test_directory(base_dir, num_files=5, duplicate_files=2):
|
|
||||||
if not os.path.exists(base_dir):
|
|
||||||
os.makedirs(base_dir)
|
|
||||||
|
|
||||||
# Erstelle eine Liste von eindeutigen Dateinamen
|
|
||||||
file_names = [f"file_{i}.txt" for i in range(num_files)]
|
|
||||||
|
|
||||||
# Erstelle einige Dateien mit zufälligem Inhalt
|
|
||||||
for file_name in file_names:
|
|
||||||
with open(os.path.join(base_dir, file_name), 'w') as f:
|
|
||||||
content = ''.join(random.choices(string.ascii_lowercase, k=20))
|
|
||||||
f.write(content)
|
|
||||||
|
|
||||||
# Erstelle Duplikate
|
|
||||||
for i in range(duplicate_files):
|
|
||||||
original = os.path.join(base_dir, file_names[i])
|
|
||||||
duplicate = os.path.join(base_dir, f"dup_{file_names[i]}")
|
|
||||||
shutil.copyfile(original, duplicate)
|
|
||||||
|
|
||||||
def create_file_structure():
|
|
||||||
# Basisverzeichnisse erstellen
|
|
||||||
base_dirs = ['test_dir1', 'test_dir2']
|
|
||||||
for base_dir in base_dirs:
|
|
||||||
create_test_directory(base_dir)
|
|
||||||
|
|
||||||
# Erstelle eine Datei im ersten Verzeichnis und dupliziere sie im zweiten
|
|
||||||
with open(os.path.join('test_dir1', 'unique_file.txt'), 'w') as f:
|
|
||||||
f.write("This is a unique file.")
|
|
||||||
|
|
||||||
shutil.copyfile(os.path.join('test_dir1', 'unique_file.txt'),
|
|
||||||
os.path.join('test_dir2', 'unique_file.txt'))
|
|
||||||
|
|
||||||
# Erstelle eine zusätzliche einzigartige Datei im zweiten Verzeichnis
|
|
||||||
with open(os.path.join('test_dir2', 'another_unique_file.txt'), 'w') as f:
|
|
||||||
f.write("This is another unique file.")
|
|
||||||
|
|
||||||
print("Test file structure created.")
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
create_file_structure()
|
|
@@ -1,35 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
if [ -z "$1" ]
|
|
||||||
then
|
|
||||||
echo "Directory path not provided"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
dir="$1"
|
|
||||||
duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
|
|
||||||
|
|
||||||
echo "Duplicates found:"
|
|
||||||
|
|
||||||
echo "$duplicates" | while read line
|
|
||||||
do
|
|
||||||
files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
|
|
||||||
for file in ${files[@]}
|
|
||||||
do
|
|
||||||
echo "File: $file"
|
|
||||||
echo "Duplicate(s) of this file:"
|
|
||||||
for duplicate in ${files[@]}
|
|
||||||
do
|
|
||||||
if [ $duplicate != $file ]
|
|
||||||
then
|
|
||||||
echo $duplicate
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
echo "Do you want to delete this file? [y/N]"
|
|
||||||
read answer
|
|
||||||
if [[ $answer == [yY] || $answer == [yY][eE][sS] ]]
|
|
||||||
then
|
|
||||||
rm -i "$file"
|
|
||||||
fi
|
|
||||||
done
|
|
||||||
done
|
|
@@ -1,30 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
if [ -z "$1" ]
|
|
||||||
then
|
|
||||||
echo "Directory path not provided"
|
|
||||||
exit 1
|
|
||||||
fi
|
|
||||||
|
|
||||||
dir="$1"
|
|
||||||
duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
|
|
||||||
|
|
||||||
if [ -z "$duplicates" ]
|
|
||||||
then
|
|
||||||
echo "No duplicates found."
|
|
||||||
exit 0
|
|
||||||
fi
|
|
||||||
|
|
||||||
echo "Duplicates found:"
|
|
||||||
|
|
||||||
echo "$duplicates" | while read line
|
|
||||||
do
|
|
||||||
files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
|
|
||||||
file_type=$(file -b --mime-type "${files[0]}")
|
|
||||||
if [[ $file_type == text/* ]]
|
|
||||||
then
|
|
||||||
diff "${files[@]}"
|
|
||||||
else
|
|
||||||
echo "$files"
|
|
||||||
fi
|
|
||||||
done
|
|
103
main.py
103
main.py
@@ -1,7 +1,9 @@
|
|||||||
import os
|
import os
|
||||||
import argparse
|
import argparse
|
||||||
import hashlib
|
import hashlib
|
||||||
from collections import defaultdict
|
from collections import defaultdict, Counter
|
||||||
|
from concurrent.futures import ProcessPoolExecutor
|
||||||
|
from tqdm import tqdm
|
||||||
|
|
||||||
def md5sum(filename):
|
def md5sum(filename):
|
||||||
hash_md5 = hashlib.md5()
|
hash_md5 = hashlib.md5()
|
||||||
@@ -10,65 +12,86 @@ def md5sum(filename):
|
|||||||
hash_md5.update(chunk)
|
hash_md5.update(chunk)
|
||||||
return hash_md5.hexdigest()
|
return hash_md5.hexdigest()
|
||||||
|
|
||||||
def find_duplicates(directories):
|
def file_hashing_job(path):
|
||||||
hashes = defaultdict(list)
|
file_hash = md5sum(path)
|
||||||
for directory in directories:
|
if file_hash:
|
||||||
for root, dirs, files in os.walk(directory):
|
return file_hash, path
|
||||||
for filename in files:
|
|
||||||
path = os.path.join(root, filename)
|
def find_duplicates(directories, file_type):
|
||||||
file_hash = md5sum(path)
|
with ProcessPoolExecutor() as executor:
|
||||||
|
futures = []
|
||||||
|
for directory in directories:
|
||||||
|
for root, dirs, files in tqdm(os.walk(directory, followlinks=False), desc=f"Indexing files of {directory}", unit="directory"):
|
||||||
|
for filename in files:
|
||||||
|
if file_type and not filename.endswith(file_type):
|
||||||
|
continue
|
||||||
|
path = os.path.join(root, filename)
|
||||||
|
if not os.path.islink(path):
|
||||||
|
futures.append(executor.submit(file_hashing_job, path))
|
||||||
|
|
||||||
|
hashes = defaultdict(list)
|
||||||
|
for future in tqdm(futures, desc="Processing files", unit="file"):
|
||||||
|
result = future.result()
|
||||||
|
if result:
|
||||||
|
file_hash, path = result
|
||||||
hashes[file_hash].append(path)
|
hashes[file_hash].append(path)
|
||||||
|
|
||||||
return {file_hash: paths for file_hash, paths in hashes.items() if len(paths) > 1}
|
return {file_hash: paths for file_hash, paths in hashes.items() if len(paths) > 1}
|
||||||
|
|
||||||
|
def handle_file_modification(original_file, duplicate_file, modification):
|
||||||
|
if modification == 'delete':
|
||||||
|
print(f"Deleting {duplicate_file}")
|
||||||
|
os.remove(duplicate_file)
|
||||||
|
elif modification == 'hardlink':
|
||||||
|
os.remove(duplicate_file)
|
||||||
|
os.link(original_file, duplicate_file)
|
||||||
|
print(f"Replaced {duplicate_file} with a hardlink to {original_file}")
|
||||||
|
elif modification == 'symlink':
|
||||||
|
os.remove(duplicate_file)
|
||||||
|
os.symlink(original_file, duplicate_file)
|
||||||
|
print(f"Replaced {duplicate_file} with a symlink to {original_file}")
|
||||||
|
|
||||||
def handle_modification(files, modification, mode, apply_to):
|
def handle_modification(files, modification, mode, apply_to):
|
||||||
if mode == 'preview':
|
original_file = next((f for f in files if not f.startswith(tuple(apply_to))), files[0])
|
||||||
if modification == 'show':
|
for duplicate_file in files:
|
||||||
print("Would show the following duplicate files:")
|
if duplicate_file != original_file:
|
||||||
for file in files:
|
if duplicate_file.startswith(tuple(apply_to)):
|
||||||
if file.startswith(tuple(apply_to)):
|
if mode == 'preview' and modification != 'show':
|
||||||
print(file)
|
print(f"Would perform {modification} on {duplicate_file}")
|
||||||
elif mode == 'act':
|
elif mode == 'act':
|
||||||
if modification == 'delete':
|
handle_file_modification(original_file, duplicate_file, modification)
|
||||||
for file in files:
|
elif mode == 'interactive':
|
||||||
if file.startswith(tuple(apply_to)):
|
answer = input(f"Do you want to {modification} this file? {duplicate_file} [y/N] ")
|
||||||
print(f"Deleting {file}")
|
if answer.lower() in ['y', 'yes']:
|
||||||
os.remove(file)
|
handle_file_modification(original_file, duplicate_file, modification)
|
||||||
elif modification == 'hardlink':
|
else:
|
||||||
# Implement hardlink logic here
|
print(f"Duplicate file (unmodified): {duplicate_file}")
|
||||||
pass
|
elif modification != 'show':
|
||||||
elif modification == 'symlink':
|
print(f"Original file kept: {original_file}")
|
||||||
# Implement symlink logic here
|
print()
|
||||||
pass
|
|
||||||
elif mode == 'interactive':
|
|
||||||
for file in files:
|
|
||||||
if file.startswith(tuple(apply_to)):
|
|
||||||
answer = input(f"Do you want to {modification} this file? {file} [y/N] ")
|
|
||||||
if answer.lower() in ['y', 'yes']:
|
|
||||||
# Implement deletion, hardlink or symlink logic here
|
|
||||||
pass
|
|
||||||
|
|
||||||
def main(args):
|
def main(args):
|
||||||
directories = args.directories
|
directories = args.directories
|
||||||
apply_to = args.apply_to or directories
|
apply_to = args.apply_to or directories
|
||||||
duplicates = find_duplicates(directories)
|
duplicates = find_duplicates(directories,args.file_type)
|
||||||
|
|
||||||
if not duplicates:
|
if not duplicates:
|
||||||
print("No duplicates found.")
|
print("No duplicates found.")
|
||||||
return
|
return
|
||||||
|
|
||||||
for file_hash, files in duplicates.items():
|
for file_hash, files in duplicates.items():
|
||||||
if args.mode == 'preview' or (args.mode == 'interactive' and args.modification == 'show'):
|
print(f"Duplicate files for hash {file_hash}:")
|
||||||
print(f"Duplicate files for hash {file_hash}:")
|
[print(file) for file in files if file.startswith(tuple(directories))]
|
||||||
[print(file) for file in files if file.startswith(tuple(apply_to))]
|
handle_modification(files, args.modification, args.mode, apply_to)
|
||||||
else:
|
|
||||||
handle_modification(files, args.modification, args.mode, apply_to)
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
parser = argparse.ArgumentParser(description="Find and handle duplicate files.")
|
parser = argparse.ArgumentParser(description="Find and handle duplicate files.")
|
||||||
parser.add_argument('directories', nargs='*', help="Directories to scan for duplicates.")
|
parser.add_argument('directories', nargs='*', help="Directories to scan for duplicates.")
|
||||||
parser.add_argument('--apply-to', nargs='*', help="Directories to apply modifications to.")
|
parser.add_argument('--apply-to', nargs='*', help="Filter directories to apply modifications to.")
|
||||||
parser.add_argument('--modification', choices=['delete', 'hardlink', 'symlink', 'show'], default='show', help="Modification to perform on duplicates.")
|
parser.add_argument('--modification', choices=['delete', 'hardlink', 'symlink', 'show'], default='show', help="Modification to perform on duplicates.")
|
||||||
parser.add_argument('--mode', choices=['act', 'preview', 'interactive'], default='preview', help="How to apply the modifications.")
|
parser.add_argument('--mode', choices=['act', 'preview', 'interactive'], default='preview', help="How to apply the modifications.")
|
||||||
|
parser.add_argument('-f', '--file-type', help="Filter by file type (e.g., '.txt' for text files).", default=None)
|
||||||
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
Reference in New Issue
Block a user