Compare commits

..

28 Commits

Author SHA1 Message Date
1e5ac1cec8 Added Funding 2025-03-12 20:52:47 +01:00
206351b067 Merge branch 'main' of github.com:kevinveenbirkenbach/duplicate-file-handler 2025-03-12 11:14:39 +01:00
8239040659 Update README.md 2025-03-12 10:49:37 +01:00
2b06bdf6c6 Update README.md 2025-03-04 20:47:26 +01:00
32669ba528 Renamed 2025-03-04 20:45:06 +01:00
89e15dd023 Reimplemented by an accident deleted function 2023-11-14 15:33:32 +01:00
e31ee231fa Improved velocity with parallel processing 2023-11-14 15:26:47 +01:00
5c4afe2655 Changed unit 2023-11-14 13:47:20 +01:00
62c6443858 updated progress 2023-11-14 13:44:20 +01:00
0c49ca0fcc implemented progress 2023-11-14 13:39:59 +01:00
8fc0d37a09 updated README.md 2023-11-14 13:23:12 +01:00
a505cfa8d3 updated test script 2023-11-14 13:18:30 +01:00
62dc1e1250 Solved filetype bug 2023-11-14 13:00:18 +01:00
9e815c6854 Optimized show logic 2023-11-14 12:56:24 +01:00
f854e46511 Deleted old scripts 2023-11-14 12:51:35 +01:00
53b1d8d0fa Deleted old scripts 2023-11-14 12:50:34 +01:00
1dc5019e4e Added filetype parameter 2023-11-14 12:50:14 +01:00
ce5db7c6da Solved symlink bug 2023-11-14 12:41:52 +01:00
f5c7a945c5 solved apply to bug 2023-11-14 12:32:48 +01:00
1384a5451f Updated arguments 2023-11-14 12:17:21 +01:00
be44b2984f Added linebreak to make it easier readable 2023-11-14 11:56:02 +01:00
3e159df46d Optimized logic to keep original file 2023-11-14 11:53:32 +01:00
a68530abe0 optimized logic 2023-11-14 11:44:51 +01:00
7c4010c59e Optimized modes 2023-11-14 11:41:11 +01:00
c2566a355d show help if no parameters are passed 2023-11-14 11:14:09 +01:00
a54274b052 optimized structure and created test script 2023-11-14 11:06:43 +01:00
73f2bbe409 implemented multi directory scan 2023-11-14 10:33:49 +01:00
db65668217 added python draft 2023-11-14 10:28:18 +01:00
7 changed files with 233 additions and 84 deletions

7
.github/FUNDING.yml vendored Normal file
View File

@@ -0,0 +1,7 @@
github: kevinveenbirkenbach
patreon: kevinveenbirkenbach
buy_me_a_coffee: kevinveenbirkenbach
custom: https://s.veen.world/paypaldonate

1
.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
test_dir*

View File

@@ -1,38 +1,96 @@
# Duplicate File Handler
# Duplicate File Handler (dufiha) 🔍
[![GitHub Sponsors](https://img.shields.io/badge/Sponsor-GitHub%20Sponsors-blue?logo=github)](https://github.com/sponsors/kevinveenbirkenbach) [![Patreon](https://img.shields.io/badge/Support-Patreon-orange?logo=patreon)](https://www.patreon.com/c/kevinveenbirkenbach) [![Buy Me a Coffee](https://img.shields.io/badge/Buy%20me%20a%20Coffee-Funding-yellow?logo=buymeacoffee)](https://buymeacoffee.com/kevinveenbirkenbach) [![PayPal](https://img.shields.io/badge/Donate-PayPal-blue?logo=paypal)](https://s.veen.world/paypaldonate)
This repository contains two bash scripts for handling duplicate files in a directory and its subdirectories.
The scripts may need to be modified depending on the specific requirements of your system or the specific use case. They currently operate by comparing the MD5 hash of files to find duplicates, which is a common but not foolproof method.
[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](./LICENSE) [![GitHub stars](https://img.shields.io/github/stars/kevinveenbirkenbach/duplicate-file-handler.svg?style=social)](https://github.com/kevinveenbirkenbach/duplicate-file-handler/stargazers)
## Author
Duplicate File Handler is a Python CLI tool for identifying and handling duplicate files within one or more directories based on their MD5 hashes. With flexible file-type filtering and multiple action modes, you can efficiently delete duplicates or replace them with hard or symbolic links.
**Kevin Veen-Birkenbach**
- Email: kevin@veen.world
- Website: [https://www.veen.world](https://www.veen.world)
---
This repository was created with the help of [OpenAI's ChatGPT](https://chat.openai.com/share/013e4367-8eca-4066-8b18-55457202ba57).
## 🛠 Features
## Setup
These scripts will help you manage duplicate files in your directories. Please make sure to adjust permissions on the scripts to be executable with `chmod +x list_duplicates.sh delete_duplicates.sh` before running.
- **Duplicate Detection:** Computes MD5 hashes for files to find duplicates.
- **File Type Filtering:** Process only files with a specified extension.
- **Multiple Modification Options:** Choose to delete duplicates, replace them with hard links, or create symbolic links.
- **Flexible Modes:** Operate in preview, interactive, or active mode to suit your workflow.
- **Parallel Processing:** Utilizes process pooling for efficient scanning of large directories.
## Usage
---
### 1. List Duplicate Files
## 📥 Installation
`list_duplicates.sh` is a script to list all duplicate files in a specified directory and its subdirectories. For text files, it will also display the diffs.
Install Duplicate File Handler via [Kevin's Package Manager](https://github.com/kevinveenbirkenbach/package-manager) under the alias `dufiha`:
```bash
./list_duplicates.sh /path/to/directory
package-manager install dufiha
```
### 2. Delete Duplicate Files
This command installs the tool globally, making it available as `dufiha` in your terminal. 🚀
`delete_duplicates.sh` is a script to find and delete duplicate files in a specified directory and its subdirectories. It will ask for confirmation before deleting each file and display the paths of its duplicates.
---
## 🚀 Usage
Run Duplicate File Handler by specifying one or more directories to scan for duplicates:
```bash
./delete_duplicates.sh /path/to/directory
dufiha [options] directory1 directory2 ...
```
## License
### Options
This project is licensed under the terms of the [GNU Affero General Public License v3.0](https://www.gnu.org/licenses/agpl-3.0.de.html).
- **`--apply-to`**: Directories to which modifications should be applied.
- **`--modification`**: Action to perform on duplicates:
- `delete` Delete duplicate files.
- `hardlink` Replace duplicates with hard links.
- `symlink` Replace duplicates with symbolic links.
- `show` Only display duplicate files (default).
- **`--mode`**: How to apply modifications:
- `act` Execute changes immediately.
- `preview` Preview changes without making any modifications.
- `interactive` Ask for confirmation before processing each duplicate.
- **`-f, --file-type`**: Filter by file type (e.g., `.txt` for text files).
### Example Commands
- **Preview duplicate `.txt` files in two directories:**
```bash
dufiha --file-type .txt --mode preview test_dir1 test_dir2
```
- **Interactively delete duplicates in a specific directory:**
```bash
dufiha --apply-to test_dir2 --modification delete --mode interactive test_dir1 test_dir2
```
- **Show duplicates without modifying any files:**
```bash
dufiha --mode show test_dir1
```
---
## 🧑‍💻 Author
Developed by **Kevin Veen-Birkenbach**
- 📧 [kevin@veen.world](mailto:kevin@veen.world)
- 🌐 [https://www.veen.world](https://www.veen.world)
This project was enhanced with assistance from [OpenAI's ChatGPT](https://chat.openai.com/share/825931d6-1e33-40b0-8dfc-914b3f852eeb).
---
## 📜 License
This project is licensed under the **GNU Affero General Public License, Version 3, 19 November 2007**.
See the [LICENSE](./LICENSE) file for details.
---
## 🤝 Contributions
Contributions are welcome! Please feel free to fork the repository, submit pull requests, or open issues to help improve Duplicate File Handler. Lets make file management smarter and more efficient! 😊

View File

@@ -0,0 +1,41 @@
import os
import shutil
import hashlib
import random
import string
def create_test_directory(base_dir, num_files=5, duplicate_files=2, depth=1):
os.makedirs(base_dir, exist_ok=True)
subdirs = [os.path.join(base_dir, f"subdir_{i}") for i in range(depth)]
for subdir in subdirs:
os.makedirs(subdir, exist_ok=True)
for dir in [base_dir] + subdirs:
file_names = [f"file_{i}.txt" for i in range(num_files)]
for file_name in file_names:
with open(os.path.join(dir, file_name), 'w') as f:
content = ''.join(random.choices(string.ascii_lowercase, k=20))
f.write(content)
for i in range(min(duplicate_files, num_files)):
original = os.path.join(dir, file_names[i])
for dup_num in range(1, duplicate_files+1):
duplicate = os.path.join(dir, f"dup_{dup_num}_{file_names[i]}")
shutil.copyfile(original, duplicate)
def copy_directory_contents(src, dst):
if os.path.exists(dst):
shutil.rmtree(dst)
shutil.copytree(src, dst)
def create_file_structure(depth, num_files, duplicate_files):
base_dirs = ['test_dir1', 'test_dir2']
for base_dir in base_dirs:
create_test_directory(base_dir, num_files, duplicate_files, depth)
copy_directory_contents('test_dir1', 'test_dir3')
print("Test file structure created.")
if __name__ == "__main__":
create_file_structure(depth=2, num_files=5, duplicate_files=3)

View File

@@ -1,35 +0,0 @@
#!/bin/bash
if [ -z "$1" ]
then
echo "Directory path not provided"
exit 1
fi
dir="$1"
duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
echo "Duplicates found:"
echo "$duplicates" | while read line
do
files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
for file in ${files[@]}
do
echo "File: $file"
echo "Duplicate(s) of this file:"
for duplicate in ${files[@]}
do
if [ $duplicate != $file ]
then
echo $duplicate
fi
done
echo "Do you want to delete this file? [y/N]"
read answer
if [[ $answer == [yY] || $answer == [yY][eE][sS] ]]
then
rm -i "$file"
fi
done
done

View File

@@ -1,30 +0,0 @@
#!/bin/bash
if [ -z "$1" ]
then
echo "Directory path not provided"
exit 1
fi
dir="$1"
duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
if [ -z "$duplicates" ]
then
echo "No duplicates found."
exit 0
fi
echo "Duplicates found:"
echo "$duplicates" | while read line
do
files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
file_type=$(file -b --mime-type "${files[0]}")
if [[ $file_type == text/* ]]
then
diff "${files[@]}"
else
echo "$files"
fi
done

107
main.py Normal file
View File

@@ -0,0 +1,107 @@
import os
import argparse
import hashlib
from collections import defaultdict, Counter
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm
def md5sum(filename):
hash_md5 = hashlib.md5()
with open(filename, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def file_hashing_job(path):
file_hash = md5sum(path)
if file_hash:
return file_hash, path
def find_duplicates(directories, file_type):
with ProcessPoolExecutor() as executor:
futures = []
for directory in directories:
for root, dirs, files in tqdm(os.walk(directory, followlinks=False), desc=f"Indexing files of {directory}", unit="directory"):
for filename in files:
if file_type and not filename.endswith(file_type):
continue
path = os.path.join(root, filename)
if not os.path.islink(path):
futures.append(executor.submit(file_hashing_job, path))
hashes = defaultdict(list)
for future in tqdm(futures, desc="Processing files", unit="file"):
result = future.result()
if result:
file_hash, path = result
hashes[file_hash].append(path)
return {file_hash: paths for file_hash, paths in hashes.items() if len(paths) > 1}
def handle_file_modification(original_file, duplicate_file, modification):
if modification == 'delete':
print(f"Deleting {duplicate_file}")
os.remove(duplicate_file)
elif modification == 'hardlink':
os.remove(duplicate_file)
os.link(original_file, duplicate_file)
print(f"Replaced {duplicate_file} with a hardlink to {original_file}")
elif modification == 'symlink':
os.remove(duplicate_file)
os.symlink(original_file, duplicate_file)
print(f"Replaced {duplicate_file} with a symlink to {original_file}")
def handle_modification(files, modification, mode, apply_to):
original_file = next((f for f in files if not f.startswith(tuple(apply_to))), files[0])
for duplicate_file in files:
if duplicate_file != original_file:
if duplicate_file.startswith(tuple(apply_to)):
if mode == 'preview' and modification != 'show':
print(f"Would perform {modification} on {duplicate_file}")
elif mode == 'act':
handle_file_modification(original_file, duplicate_file, modification)
elif mode == 'interactive':
answer = input(f"Do you want to {modification} this file? {duplicate_file} [y/N] ")
if answer.lower() in ['y', 'yes']:
handle_file_modification(original_file, duplicate_file, modification)
else:
print(f"Duplicate file (unmodified): {duplicate_file}")
elif modification != 'show':
print(f"Original file kept: {original_file}")
print()
def main(args):
directories = args.directories
apply_to = args.apply_to or directories
duplicates = find_duplicates(directories,args.file_type)
if not duplicates:
print("No duplicates found.")
return
for file_hash, files in duplicates.items():
print(f"Duplicate files for hash {file_hash}:")
[print(file) for file in files if file.startswith(tuple(directories))]
handle_modification(files, args.modification, args.mode, apply_to)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Find and handle duplicate files.")
parser.add_argument('directories', nargs='*', help="Directories to scan for duplicates.")
parser.add_argument('--apply-to', nargs='*', help="Filter directories to apply modifications to.")
parser.add_argument('--modification', choices=['delete', 'hardlink', 'symlink', 'show'], default='show', help="Modification to perform on duplicates.")
parser.add_argument('--mode', choices=['act', 'preview', 'interactive'], default='preview', help="How to apply the modifications.")
parser.add_argument('-f', '--file-type', help="Filter by file type (e.g., '.txt' for text files).", default=None)
args = parser.parse_args()
if not args.directories:
parser.print_help()
parser.exit()
if args.apply_to and args.modification not in ['delete', 'hardlink', 'symlink']:
parser.error("--apply-to requires --modification to be 'delete', 'hardlink', or 'symlink'.")
if not args.apply_to and args.modification != 'show':
parser.error("Without --apply-to only 'show' modification is allowed.")
main(args)