mirror of
https://github.com/kevinveenbirkenbach/duplicate-file-handler.git
synced 2025-09-09 19:57:12 +02:00
Compare commits
28 Commits
54bc794b50
...
main
Author | SHA1 | Date | |
---|---|---|---|
1e5ac1cec8 | |||
206351b067 | |||
8239040659 | |||
2b06bdf6c6 | |||
32669ba528 | |||
89e15dd023 | |||
e31ee231fa | |||
5c4afe2655 | |||
62c6443858 | |||
0c49ca0fcc | |||
8fc0d37a09 | |||
a505cfa8d3 | |||
62dc1e1250 | |||
9e815c6854 | |||
f854e46511 | |||
53b1d8d0fa | |||
1dc5019e4e | |||
ce5db7c6da | |||
f5c7a945c5 | |||
1384a5451f | |||
be44b2984f | |||
3e159df46d | |||
a68530abe0 | |||
7c4010c59e | |||
c2566a355d | |||
a54274b052 | |||
73f2bbe409 | |||
db65668217 |
7
.github/FUNDING.yml
vendored
Normal file
7
.github/FUNDING.yml
vendored
Normal file
@@ -0,0 +1,7 @@
|
||||
github: kevinveenbirkenbach
|
||||
|
||||
patreon: kevinveenbirkenbach
|
||||
|
||||
buy_me_a_coffee: kevinveenbirkenbach
|
||||
|
||||
custom: https://s.veen.world/paypaldonate
|
1
.gitignore
vendored
Normal file
1
.gitignore
vendored
Normal file
@@ -0,0 +1 @@
|
||||
test_dir*
|
96
README.md
96
README.md
@@ -1,38 +1,96 @@
|
||||
# Duplicate File Handler
|
||||
# Duplicate File Handler (dufiha) 🔍
|
||||
[](https://github.com/sponsors/kevinveenbirkenbach) [](https://www.patreon.com/c/kevinveenbirkenbach) [](https://buymeacoffee.com/kevinveenbirkenbach) [](https://s.veen.world/paypaldonate)
|
||||
|
||||
This repository contains two bash scripts for handling duplicate files in a directory and its subdirectories.
|
||||
|
||||
The scripts may need to be modified depending on the specific requirements of your system or the specific use case. They currently operate by comparing the MD5 hash of files to find duplicates, which is a common but not foolproof method.
|
||||
[](./LICENSE) [](https://github.com/kevinveenbirkenbach/duplicate-file-handler/stargazers)
|
||||
|
||||
## Author
|
||||
Duplicate File Handler is a Python CLI tool for identifying and handling duplicate files within one or more directories based on their MD5 hashes. With flexible file-type filtering and multiple action modes, you can efficiently delete duplicates or replace them with hard or symbolic links.
|
||||
|
||||
**Kevin Veen-Birkenbach**
|
||||
- Email: kevin@veen.world
|
||||
- Website: [https://www.veen.world](https://www.veen.world)
|
||||
---
|
||||
|
||||
This repository was created with the help of [OpenAI's ChatGPT](https://chat.openai.com/share/013e4367-8eca-4066-8b18-55457202ba57).
|
||||
## 🛠 Features
|
||||
|
||||
## Setup
|
||||
These scripts will help you manage duplicate files in your directories. Please make sure to adjust permissions on the scripts to be executable with `chmod +x list_duplicates.sh delete_duplicates.sh` before running.
|
||||
- **Duplicate Detection:** Computes MD5 hashes for files to find duplicates.
|
||||
- **File Type Filtering:** Process only files with a specified extension.
|
||||
- **Multiple Modification Options:** Choose to delete duplicates, replace them with hard links, or create symbolic links.
|
||||
- **Flexible Modes:** Operate in preview, interactive, or active mode to suit your workflow.
|
||||
- **Parallel Processing:** Utilizes process pooling for efficient scanning of large directories.
|
||||
|
||||
## Usage
|
||||
---
|
||||
|
||||
### 1. List Duplicate Files
|
||||
## 📥 Installation
|
||||
|
||||
`list_duplicates.sh` is a script to list all duplicate files in a specified directory and its subdirectories. For text files, it will also display the diffs.
|
||||
Install Duplicate File Handler via [Kevin's Package Manager](https://github.com/kevinveenbirkenbach/package-manager) under the alias `dufiha`:
|
||||
|
||||
```bash
|
||||
./list_duplicates.sh /path/to/directory
|
||||
package-manager install dufiha
|
||||
```
|
||||
|
||||
### 2. Delete Duplicate Files
|
||||
This command installs the tool globally, making it available as `dufiha` in your terminal. 🚀
|
||||
|
||||
`delete_duplicates.sh` is a script to find and delete duplicate files in a specified directory and its subdirectories. It will ask for confirmation before deleting each file and display the paths of its duplicates.
|
||||
---
|
||||
|
||||
## 🚀 Usage
|
||||
|
||||
Run Duplicate File Handler by specifying one or more directories to scan for duplicates:
|
||||
|
||||
```bash
|
||||
./delete_duplicates.sh /path/to/directory
|
||||
dufiha [options] directory1 directory2 ...
|
||||
```
|
||||
|
||||
## License
|
||||
### Options
|
||||
|
||||
This project is licensed under the terms of the [GNU Affero General Public License v3.0](https://www.gnu.org/licenses/agpl-3.0.de.html).
|
||||
- **`--apply-to`**: Directories to which modifications should be applied.
|
||||
- **`--modification`**: Action to perform on duplicates:
|
||||
- `delete` – Delete duplicate files.
|
||||
- `hardlink` – Replace duplicates with hard links.
|
||||
- `symlink` – Replace duplicates with symbolic links.
|
||||
- `show` – Only display duplicate files (default).
|
||||
- **`--mode`**: How to apply modifications:
|
||||
- `act` – Execute changes immediately.
|
||||
- `preview` – Preview changes without making any modifications.
|
||||
- `interactive` – Ask for confirmation before processing each duplicate.
|
||||
- **`-f, --file-type`**: Filter by file type (e.g., `.txt` for text files).
|
||||
|
||||
### Example Commands
|
||||
|
||||
- **Preview duplicate `.txt` files in two directories:**
|
||||
|
||||
```bash
|
||||
dufiha --file-type .txt --mode preview test_dir1 test_dir2
|
||||
```
|
||||
|
||||
- **Interactively delete duplicates in a specific directory:**
|
||||
|
||||
```bash
|
||||
dufiha --apply-to test_dir2 --modification delete --mode interactive test_dir1 test_dir2
|
||||
```
|
||||
|
||||
- **Show duplicates without modifying any files:**
|
||||
|
||||
```bash
|
||||
dufiha --mode show test_dir1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🧑💻 Author
|
||||
|
||||
Developed by **Kevin Veen-Birkenbach**
|
||||
- 📧 [kevin@veen.world](mailto:kevin@veen.world)
|
||||
- 🌐 [https://www.veen.world](https://www.veen.world)
|
||||
|
||||
This project was enhanced with assistance from [OpenAI's ChatGPT](https://chat.openai.com/share/825931d6-1e33-40b0-8dfc-914b3f852eeb).
|
||||
|
||||
---
|
||||
|
||||
## 📜 License
|
||||
|
||||
This project is licensed under the **GNU Affero General Public License, Version 3, 19 November 2007**.
|
||||
See the [LICENSE](./LICENSE) file for details.
|
||||
|
||||
---
|
||||
|
||||
## 🤝 Contributions
|
||||
|
||||
Contributions are welcome! Please feel free to fork the repository, submit pull requests, or open issues to help improve Duplicate File Handler. Let’s make file management smarter and more efficient! 😊
|
||||
|
41
create_test_file_structure.py
Normal file
41
create_test_file_structure.py
Normal file
@@ -0,0 +1,41 @@
|
||||
import os
|
||||
import shutil
|
||||
import hashlib
|
||||
import random
|
||||
import string
|
||||
|
||||
def create_test_directory(base_dir, num_files=5, duplicate_files=2, depth=1):
|
||||
os.makedirs(base_dir, exist_ok=True)
|
||||
subdirs = [os.path.join(base_dir, f"subdir_{i}") for i in range(depth)]
|
||||
for subdir in subdirs:
|
||||
os.makedirs(subdir, exist_ok=True)
|
||||
|
||||
for dir in [base_dir] + subdirs:
|
||||
file_names = [f"file_{i}.txt" for i in range(num_files)]
|
||||
for file_name in file_names:
|
||||
with open(os.path.join(dir, file_name), 'w') as f:
|
||||
content = ''.join(random.choices(string.ascii_lowercase, k=20))
|
||||
f.write(content)
|
||||
|
||||
for i in range(min(duplicate_files, num_files)):
|
||||
original = os.path.join(dir, file_names[i])
|
||||
for dup_num in range(1, duplicate_files+1):
|
||||
duplicate = os.path.join(dir, f"dup_{dup_num}_{file_names[i]}")
|
||||
shutil.copyfile(original, duplicate)
|
||||
|
||||
def copy_directory_contents(src, dst):
|
||||
if os.path.exists(dst):
|
||||
shutil.rmtree(dst)
|
||||
shutil.copytree(src, dst)
|
||||
|
||||
def create_file_structure(depth, num_files, duplicate_files):
|
||||
base_dirs = ['test_dir1', 'test_dir2']
|
||||
for base_dir in base_dirs:
|
||||
create_test_directory(base_dir, num_files, duplicate_files, depth)
|
||||
|
||||
copy_directory_contents('test_dir1', 'test_dir3')
|
||||
|
||||
print("Test file structure created.")
|
||||
|
||||
if __name__ == "__main__":
|
||||
create_file_structure(depth=2, num_files=5, duplicate_files=3)
|
@@ -1,35 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
if [ -z "$1" ]
|
||||
then
|
||||
echo "Directory path not provided"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
dir="$1"
|
||||
duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
|
||||
|
||||
echo "Duplicates found:"
|
||||
|
||||
echo "$duplicates" | while read line
|
||||
do
|
||||
files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
|
||||
for file in ${files[@]}
|
||||
do
|
||||
echo "File: $file"
|
||||
echo "Duplicate(s) of this file:"
|
||||
for duplicate in ${files[@]}
|
||||
do
|
||||
if [ $duplicate != $file ]
|
||||
then
|
||||
echo $duplicate
|
||||
fi
|
||||
done
|
||||
echo "Do you want to delete this file? [y/N]"
|
||||
read answer
|
||||
if [[ $answer == [yY] || $answer == [yY][eE][sS] ]]
|
||||
then
|
||||
rm -i "$file"
|
||||
fi
|
||||
done
|
||||
done
|
@@ -1,30 +0,0 @@
|
||||
#!/bin/bash
|
||||
|
||||
if [ -z "$1" ]
|
||||
then
|
||||
echo "Directory path not provided"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
dir="$1"
|
||||
duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
|
||||
|
||||
if [ -z "$duplicates" ]
|
||||
then
|
||||
echo "No duplicates found."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
echo "Duplicates found:"
|
||||
|
||||
echo "$duplicates" | while read line
|
||||
do
|
||||
files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
|
||||
file_type=$(file -b --mime-type "${files[0]}")
|
||||
if [[ $file_type == text/* ]]
|
||||
then
|
||||
diff "${files[@]}"
|
||||
else
|
||||
echo "$files"
|
||||
fi
|
||||
done
|
107
main.py
Normal file
107
main.py
Normal file
@@ -0,0 +1,107 @@
|
||||
import os
|
||||
import argparse
|
||||
import hashlib
|
||||
from collections import defaultdict, Counter
|
||||
from concurrent.futures import ProcessPoolExecutor
|
||||
from tqdm import tqdm
|
||||
|
||||
def md5sum(filename):
|
||||
hash_md5 = hashlib.md5()
|
||||
with open(filename, "rb") as f:
|
||||
for chunk in iter(lambda: f.read(4096), b""):
|
||||
hash_md5.update(chunk)
|
||||
return hash_md5.hexdigest()
|
||||
|
||||
def file_hashing_job(path):
|
||||
file_hash = md5sum(path)
|
||||
if file_hash:
|
||||
return file_hash, path
|
||||
|
||||
def find_duplicates(directories, file_type):
|
||||
with ProcessPoolExecutor() as executor:
|
||||
futures = []
|
||||
for directory in directories:
|
||||
for root, dirs, files in tqdm(os.walk(directory, followlinks=False), desc=f"Indexing files of {directory}", unit="directory"):
|
||||
for filename in files:
|
||||
if file_type and not filename.endswith(file_type):
|
||||
continue
|
||||
path = os.path.join(root, filename)
|
||||
if not os.path.islink(path):
|
||||
futures.append(executor.submit(file_hashing_job, path))
|
||||
|
||||
hashes = defaultdict(list)
|
||||
for future in tqdm(futures, desc="Processing files", unit="file"):
|
||||
result = future.result()
|
||||
if result:
|
||||
file_hash, path = result
|
||||
hashes[file_hash].append(path)
|
||||
|
||||
return {file_hash: paths for file_hash, paths in hashes.items() if len(paths) > 1}
|
||||
|
||||
def handle_file_modification(original_file, duplicate_file, modification):
|
||||
if modification == 'delete':
|
||||
print(f"Deleting {duplicate_file}")
|
||||
os.remove(duplicate_file)
|
||||
elif modification == 'hardlink':
|
||||
os.remove(duplicate_file)
|
||||
os.link(original_file, duplicate_file)
|
||||
print(f"Replaced {duplicate_file} with a hardlink to {original_file}")
|
||||
elif modification == 'symlink':
|
||||
os.remove(duplicate_file)
|
||||
os.symlink(original_file, duplicate_file)
|
||||
print(f"Replaced {duplicate_file} with a symlink to {original_file}")
|
||||
|
||||
def handle_modification(files, modification, mode, apply_to):
|
||||
original_file = next((f for f in files if not f.startswith(tuple(apply_to))), files[0])
|
||||
for duplicate_file in files:
|
||||
if duplicate_file != original_file:
|
||||
if duplicate_file.startswith(tuple(apply_to)):
|
||||
if mode == 'preview' and modification != 'show':
|
||||
print(f"Would perform {modification} on {duplicate_file}")
|
||||
elif mode == 'act':
|
||||
handle_file_modification(original_file, duplicate_file, modification)
|
||||
elif mode == 'interactive':
|
||||
answer = input(f"Do you want to {modification} this file? {duplicate_file} [y/N] ")
|
||||
if answer.lower() in ['y', 'yes']:
|
||||
handle_file_modification(original_file, duplicate_file, modification)
|
||||
else:
|
||||
print(f"Duplicate file (unmodified): {duplicate_file}")
|
||||
elif modification != 'show':
|
||||
print(f"Original file kept: {original_file}")
|
||||
print()
|
||||
|
||||
|
||||
def main(args):
|
||||
directories = args.directories
|
||||
apply_to = args.apply_to or directories
|
||||
duplicates = find_duplicates(directories,args.file_type)
|
||||
|
||||
if not duplicates:
|
||||
print("No duplicates found.")
|
||||
return
|
||||
|
||||
for file_hash, files in duplicates.items():
|
||||
print(f"Duplicate files for hash {file_hash}:")
|
||||
[print(file) for file in files if file.startswith(tuple(directories))]
|
||||
handle_modification(files, args.modification, args.mode, apply_to)
|
||||
|
||||
if __name__ == "__main__":
|
||||
parser = argparse.ArgumentParser(description="Find and handle duplicate files.")
|
||||
parser.add_argument('directories', nargs='*', help="Directories to scan for duplicates.")
|
||||
parser.add_argument('--apply-to', nargs='*', help="Filter directories to apply modifications to.")
|
||||
parser.add_argument('--modification', choices=['delete', 'hardlink', 'symlink', 'show'], default='show', help="Modification to perform on duplicates.")
|
||||
parser.add_argument('--mode', choices=['act', 'preview', 'interactive'], default='preview', help="How to apply the modifications.")
|
||||
parser.add_argument('-f', '--file-type', help="Filter by file type (e.g., '.txt' for text files).", default=None)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.directories:
|
||||
parser.print_help()
|
||||
parser.exit()
|
||||
|
||||
if args.apply_to and args.modification not in ['delete', 'hardlink', 'symlink']:
|
||||
parser.error("--apply-to requires --modification to be 'delete', 'hardlink', or 'symlink'.")
|
||||
if not args.apply_to and args.modification != 'show':
|
||||
parser.error("Without --apply-to only 'show' modification is allowed.")
|
||||
|
||||
main(args)
|
Reference in New Issue
Block a user