Added Funding

Merge branch 'main' of github.com:kevinveenbirkenbach/duplicate-file-handler
Update README.md
2025-09-09 19:57:12 +02:00 · 2025-03-12 20:52:47 +01:00 · 2025-03-12 11:14:39 +01:00 · 2025-03-12 10:49:37 +01:00 · 2025-03-04 20:47:26 +01:00 · 2025-03-04 20:45:06 +01:00
7 changed files with 233 additions and 84 deletions
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@@ -0,0 +1,7 @@
+github: kevinveenbirkenbach
+
+patreon: kevinveenbirkenbach
+
+buy_me_a_coffee: kevinveenbirkenbach
+
+custom: https://s.veen.world/paypaldonate
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1 @@
+test_dir*
--- a/README.md
+++ b/README.md
@@ -1,38 +1,96 @@
-# Duplicate File Handler
+# Duplicate File Handler (dufiha) 🔍
+[![GitHub Sponsors](https://img.shields.io/badge/Sponsor-GitHub%20Sponsors-blue?logo=github)](https://github.com/sponsors/kevinveenbirkenbach) [![Patreon](https://img.shields.io/badge/Support-Patreon-orange?logo=patreon)](https://www.patreon.com/c/kevinveenbirkenbach) [![Buy Me a Coffee](https://img.shields.io/badge/Buy%20me%20a%20Coffee-Funding-yellow?logo=buymeacoffee)](https://buymeacoffee.com/kevinveenbirkenbach) [![PayPal](https://img.shields.io/badge/Donate-PayPal-blue?logo=paypal)](https://s.veen.world/paypaldonate)

-This repository contains two bash scripts for handling duplicate files in a directory and its subdirectories.

-The scripts may need to be modified depending on the specific requirements of your system or the specific use case. They currently operate by comparing the MD5 hash of files to find duplicates, which is a common but not foolproof method.
+[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](./LICENSE) [![GitHub stars](https://img.shields.io/github/stars/kevinveenbirkenbach/duplicate-file-handler.svg?style=social)](https://github.com/kevinveenbirkenbach/duplicate-file-handler/stargazers)

-## Author
+Duplicate File Handler is a Python CLI tool for identifying and handling duplicate files within one or more directories based on their MD5 hashes. With flexible file-type filtering and multiple action modes, you can efficiently delete duplicates or replace them with hard or symbolic links.

-**Kevin Veen-Birkenbach**
- Email: kevin@veen.world
- Website: [https://www.veen.world](https://www.veen.world)
+---

-This repository was created with the help of [OpenAI's ChatGPT](https://chat.openai.com/share/013e4367-8eca-4066-8b18-55457202ba57).
+## 🛠 Features

-## Setup 
-These scripts will help you manage duplicate files in your directories. Please make sure to adjust permissions on the scripts to be executable with `chmod +x list_duplicates.sh delete_duplicates.sh` before running. 
+- **Duplicate Detection:** Computes MD5 hashes for files to find duplicates.
+- **File Type Filtering:** Process only files with a specified extension.
+- **Multiple Modification Options:** Choose to delete duplicates, replace them with hard links, or create symbolic links.
+- **Flexible Modes:** Operate in preview, interactive, or active mode to suit your workflow.
+- **Parallel Processing:** Utilizes process pooling for efficient scanning of large directories.

-## Usage
+---

-### 1. List Duplicate Files
+## 📥 Installation

-`list_duplicates.sh` is a script to list all duplicate files in a specified directory and its subdirectories. For text files, it will also display the diffs.
+Install Duplicate File Handler via [Kevin's Package Manager](https://github.com/kevinveenbirkenbach/package-manager) under the alias `dufiha`:

 ```bash
-./list_duplicates.sh /path/to/directory
+package-manager install dufiha
 ```

-### 2. Delete Duplicate Files
+This command installs the tool globally, making it available as `dufiha` in your terminal. 🚀

-`delete_duplicates.sh` is a script to find and delete duplicate files in a specified directory and its subdirectories. It will ask for confirmation before deleting each file and display the paths of its duplicates.
+---
+
+## 🚀 Usage
+
+Run Duplicate File Handler by specifying one or more directories to scan for duplicates:

 ```bash
-./delete_duplicates.sh /path/to/directory
+dufiha [options] directory1 directory2 ...
 ```

-## License
+### Options

-This project is licensed under the terms of the [GNU Affero General Public License v3.0](https://www.gnu.org/licenses/agpl-3.0.de.html).
+- **`--apply-to`**: Directories to which modifications should be applied.
+- **`--modification`**: Action to perform on duplicates:
+  - `delete` – Delete duplicate files.
+  - `hardlink` – Replace duplicates with hard links.
+  - `symlink` – Replace duplicates with symbolic links.
+  - `show` – Only display duplicate files (default).
+- **`--mode`**: How to apply modifications:
+  - `act` – Execute changes immediately.
+  - `preview` – Preview changes without making any modifications.
+  - `interactive` – Ask for confirmation before processing each duplicate.
+- **`-f, --file-type`**: Filter by file type (e.g., `.txt` for text files).
+
+### Example Commands
+
+- **Preview duplicate `.txt` files in two directories:**
+
+  ```bash
+  dufiha --file-type .txt --mode preview test_dir1 test_dir2
+  ```
+
+- **Interactively delete duplicates in a specific directory:**
+
+  ```bash
+  dufiha --apply-to test_dir2 --modification delete --mode interactive test_dir1 test_dir2
+  ```
+
+- **Show duplicates without modifying any files:**
+
+  ```bash
+  dufiha --mode show test_dir1
+  ```
+
+---
+
+## 🧑‍💻 Author
+
+Developed by **Kevin Veen-Birkenbach**  
+- 📧 [kevin@veen.world](mailto:kevin@veen.world)  
+- 🌐 [https://www.veen.world](https://www.veen.world)
+
+This project was enhanced with assistance from [OpenAI's ChatGPT](https://chat.openai.com/share/825931d6-1e33-40b0-8dfc-914b3f852eeb).
+
+---
+
+## 📜 License
+
+This project is licensed under the **GNU Affero General Public License, Version 3, 19 November 2007**.  
+See the [LICENSE](./LICENSE) file for details.
+
+---
+
+## 🤝 Contributions
+
+Contributions are welcome! Please feel free to fork the repository, submit pull requests, or open issues to help improve Duplicate File Handler. Let’s make file management smarter and more efficient! 😊
--- a/create_test_file_structure.py
+++ b/create_test_file_structure.py
@@ -0,0 +1,41 @@
+import os
+import shutil
+import hashlib
+import random
+import string
+
+def create_test_directory(base_dir, num_files=5, duplicate_files=2, depth=1):
+    os.makedirs(base_dir, exist_ok=True)
+    subdirs = [os.path.join(base_dir, f"subdir_{i}") for i in range(depth)]
+    for subdir in subdirs:
+        os.makedirs(subdir, exist_ok=True)
+
+    for dir in [base_dir] + subdirs:
+        file_names = [f"file_{i}.txt" for i in range(num_files)]
+        for file_name in file_names:
+            with open(os.path.join(dir, file_name), 'w') as f:
+                content = ''.join(random.choices(string.ascii_lowercase, k=20))
+                f.write(content)
+
+        for i in range(min(duplicate_files, num_files)):
+            original = os.path.join(dir, file_names[i])
+            for dup_num in range(1, duplicate_files+1):
+                duplicate = os.path.join(dir, f"dup_{dup_num}_{file_names[i]}")
+                shutil.copyfile(original, duplicate)
+
+def copy_directory_contents(src, dst):
+    if os.path.exists(dst):
+        shutil.rmtree(dst)
+    shutil.copytree(src, dst)
+
+def create_file_structure(depth, num_files, duplicate_files):
+    base_dirs = ['test_dir1', 'test_dir2']
+    for base_dir in base_dirs:
+        create_test_directory(base_dir, num_files, duplicate_files, depth)
+
+    copy_directory_contents('test_dir1', 'test_dir3')
+
+    print("Test file structure created.")
+
+if __name__ == "__main__":
+    create_file_structure(depth=2, num_files=5, duplicate_files=3)
--- a/delete_duplicates.sh
+++ b/delete_duplicates.sh
@@ -1,35 +0,0 @@
-#!/bin/bash
-
-if [ -z "$1" ]
-then
-    echo "Directory path not provided"
-    exit 1
-fi
-
-dir="$1"
-duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
-
-echo "Duplicates found:"
-
-echo "$duplicates" | while read line
-do
-    files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
-    for file in ${files[@]}
-    do
-        echo "File: $file"
-        echo "Duplicate(s) of this file:"
-        for duplicate in ${files[@]}
-        do
-            if [ $duplicate != $file ]
-            then
-                echo $duplicate
-            fi
-        done
-        echo "Do you want to delete this file? [y/N]"
-        read answer
-        if [[ $answer == [yY] || $answer == [yY][eE][sS] ]]
-        then
-            rm -i "$file"
-        fi
-    done
-done
--- a/list_duplicates.sh
+++ b/list_duplicates.sh
@@ -1,30 +0,0 @@
-#!/bin/bash
-
-if [ -z "$1" ]
-then
-    echo "Directory path not provided"
-    exit 1
-fi
-
-dir="$1"
-duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
-
-if [ -z "$duplicates" ]
-then
-    echo "No duplicates found."
-    exit 0
-fi
-
-echo "Duplicates found:"
-
-echo "$duplicates" | while read line
-do
-    files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
-    file_type=$(file -b --mime-type "${files[0]}")
-    if [[ $file_type == text/* ]]
-    then
-        diff "${files[@]}"
-    else
-        echo "$files"
-    fi
-done
--- a/main.py
+++ b/main.py
@@ -0,0 +1,107 @@
+import os
+import argparse
+import hashlib
+from collections import defaultdict, Counter
+from concurrent.futures import ProcessPoolExecutor
+from tqdm import tqdm
+
+def md5sum(filename):
+    hash_md5 = hashlib.md5()
+    with open(filename, "rb") as f:
+        for chunk in iter(lambda: f.read(4096), b""):
+            hash_md5.update(chunk)
+    return hash_md5.hexdigest()
+
+def file_hashing_job(path):
+    file_hash = md5sum(path)
+    if file_hash:
+        return file_hash, path
+
+def find_duplicates(directories, file_type):
+    with ProcessPoolExecutor() as executor:
+        futures = []
+        for directory in directories:
+            for root, dirs, files in tqdm(os.walk(directory, followlinks=False), desc=f"Indexing files of {directory}", unit="directory"):
+                for filename in files:
+                    if file_type and not filename.endswith(file_type):
+                        continue
+                    path = os.path.join(root, filename)
+                    if not os.path.islink(path):
+                        futures.append(executor.submit(file_hashing_job, path))
+
+        hashes = defaultdict(list)
+        for future in tqdm(futures, desc="Processing files", unit="file"):
+            result = future.result()
+            if result:
+                file_hash, path = result
+                hashes[file_hash].append(path)
+
+    return {file_hash: paths for file_hash, paths in hashes.items() if len(paths) > 1}
+
+def handle_file_modification(original_file, duplicate_file, modification):
+    if modification == 'delete':
+        print(f"Deleting {duplicate_file}")
+        os.remove(duplicate_file)
+    elif modification == 'hardlink':
+        os.remove(duplicate_file)
+        os.link(original_file, duplicate_file)
+        print(f"Replaced {duplicate_file} with a hardlink to {original_file}")
+    elif modification == 'symlink':
+        os.remove(duplicate_file)
+        os.symlink(original_file, duplicate_file)
+        print(f"Replaced {duplicate_file} with a symlink to {original_file}")
+        
+def handle_modification(files, modification, mode, apply_to):
+    original_file = next((f for f in files if not f.startswith(tuple(apply_to))), files[0])
+    for duplicate_file in files:
+        if duplicate_file != original_file:
+            if duplicate_file.startswith(tuple(apply_to)):
+                if mode == 'preview' and modification != 'show':
+                    print(f"Would perform {modification} on {duplicate_file}")
+                elif mode == 'act':
+                    handle_file_modification(original_file, duplicate_file, modification)
+                elif mode == 'interactive':
+                    answer = input(f"Do you want to {modification} this file? {duplicate_file} [y/N] ")
+                    if answer.lower() in ['y', 'yes']:
+                        handle_file_modification(original_file, duplicate_file, modification)
+            else:
+                print(f"Duplicate file (unmodified): {duplicate_file}")
+        elif modification != 'show':
+            print(f"Original file kept: {original_file}")
+    print()
+
+
+def main(args):
+    directories = args.directories
+    apply_to = args.apply_to or directories
+    duplicates = find_duplicates(directories,args.file_type)
+    
+    if not duplicates:
+        print("No duplicates found.")
+        return
+    
+    for file_hash, files in duplicates.items():
+        print(f"Duplicate files for hash {file_hash}:")
+        [print(file) for file in files if file.startswith(tuple(directories))]
+        handle_modification(files, args.modification, args.mode, apply_to)
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Find and handle duplicate files.")
+    parser.add_argument('directories', nargs='*', help="Directories to scan for duplicates.")
+    parser.add_argument('--apply-to', nargs='*', help="Filter directories to apply modifications to.")
+    parser.add_argument('--modification', choices=['delete', 'hardlink', 'symlink', 'show'], default='show', help="Modification to perform on duplicates.")
+    parser.add_argument('--mode', choices=['act', 'preview', 'interactive'], default='preview', help="How to apply the modifications.")
+    parser.add_argument('-f', '--file-type', help="Filter by file type (e.g., '.txt' for text files).", default=None)
+
+    args = parser.parse_args()
+    
+    if not args.directories:
+        parser.print_help()
+        parser.exit()
+
+    if args.apply_to and args.modification not in ['delete', 'hardlink', 'symlink']:
+        parser.error("--apply-to requires --modification to be 'delete', 'hardlink', or 'symlink'.")
+    if not args.apply_to and args.modification != 'show':
+        parser.error("Without --apply-to only 'show' modification is allowed.")
+
+    main(args)
Author	SHA1	Message	Date
Kevin Veen-Birkenbach	1e5ac1cec8	Added Funding	2025-03-12 20:52:47 +01:00
Kevin Veen-Birkenbach	206351b067	Merge branch 'main' of github.com:kevinveenbirkenbach/duplicate-file-handler	2025-03-12 11:14:39 +01:00
Kevin Veen-Birkenbach	8239040659	Update README.md	2025-03-12 10:49:37 +01:00
Kevin Veen-Birkenbach	2b06bdf6c6	Update README.md	2025-03-04 20:47:26 +01:00
Kevin Veen-Birkenbach	32669ba528	Renamed	2025-03-04 20:45:06 +01:00
Kevin Veen-Birkenbach	89e15dd023	Reimplemented by an accident deleted function	2023-11-14 15:33:32 +01:00
Kevin Veen-Birkenbach	e31ee231fa	Improved velocity with parallel processing	2023-11-14 15:26:47 +01:00
Kevin Veen-Birkenbach	5c4afe2655	Changed unit	2023-11-14 13:47:20 +01:00
Kevin Veen-Birkenbach	62c6443858	updated progress	2023-11-14 13:44:20 +01:00
Kevin Veen-Birkenbach	0c49ca0fcc	implemented progress	2023-11-14 13:39:59 +01:00
Kevin Veen-Birkenbach	8fc0d37a09	updated README.md	2023-11-14 13:23:12 +01:00
Kevin Veen-Birkenbach	a505cfa8d3	updated test script	2023-11-14 13:18:30 +01:00
Kevin Veen-Birkenbach	62dc1e1250	Solved filetype bug	2023-11-14 13:00:18 +01:00
Kevin Veen-Birkenbach	9e815c6854	Optimized show logic	2023-11-14 12:56:24 +01:00
Kevin Veen-Birkenbach	f854e46511	Deleted old scripts	2023-11-14 12:51:35 +01:00
Kevin Veen-Birkenbach	53b1d8d0fa	Deleted old scripts	2023-11-14 12:50:34 +01:00
Kevin Veen-Birkenbach	1dc5019e4e	Added filetype parameter	2023-11-14 12:50:14 +01:00
Kevin Veen-Birkenbach	ce5db7c6da	Solved symlink bug	2023-11-14 12:41:52 +01:00
Kevin Veen-Birkenbach	f5c7a945c5	solved apply to bug	2023-11-14 12:32:48 +01:00
Kevin Veen-Birkenbach	1384a5451f	Updated arguments	2023-11-14 12:17:21 +01:00
Kevin Veen-Birkenbach	be44b2984f	Added linebreak to make it easier readable	2023-11-14 11:56:02 +01:00
Kevin Veen-Birkenbach	3e159df46d	Optimized logic to keep original file	2023-11-14 11:53:32 +01:00
Kevin Veen-Birkenbach	a68530abe0	optimized logic	2023-11-14 11:44:51 +01:00
Kevin Veen-Birkenbach	7c4010c59e	Optimized modes	2023-11-14 11:41:11 +01:00
Kevin Veen-Birkenbach	c2566a355d	show help if no parameters are passed	2023-11-14 11:14:09 +01:00
Kevin Veen-Birkenbach	a54274b052	optimized structure and created test script	2023-11-14 11:06:43 +01:00
Kevin Veen-Birkenbach	73f2bbe409	implemented multi directory scan	2023-11-14 10:33:49 +01:00
Kevin Veen-Birkenbach	db65668217	added python draft	2023-11-14 10:28:18 +01:00