Added Funding

Merge branch 'main' of github.com:kevinveenbirkenbach/duplicate-file-handler
Update README.md
2025-09-10 12:17:17 +02:00 · 2025-03-12 20:52:47 +01:00 · 2025-03-12 11:14:39 +01:00 · 2025-03-12 10:49:37 +01:00 · 2025-03-04 20:47:26 +01:00 · 2025-03-04 20:45:06 +01:00
8 changed files with 189 additions and 172 deletions
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@@ -0,0 +1,7 @@
 github: kevinveenbirkenbach
 patreon: kevinveenbirkenbach
 buy_me_a_coffee: kevinveenbirkenbach
 custom: https://s.veen.world/paypaldonate
--- a/.gitignore
+++ b/.gitignore
@@ -1,2 +1 @@
-test_dir1
+test_dir*
 test_dir2
--- a/README.md
+++ b/README.md
@@ -1,38 +1,96 @@
-# Duplicate File Handler
+# Duplicate File Handler (dufiha) 🔍
 [![GitHub Sponsors](https://img.shields.io/badge/Sponsor-GitHub%20Sponsors-blue?logo=github)](https://github.com/sponsors/kevinveenbirkenbach) [![Patreon](https://img.shields.io/badge/Support-Patreon-orange?logo=patreon)](https://www.patreon.com/c/kevinveenbirkenbach) [![Buy Me a Coffee](https://img.shields.io/badge/Buy%20me%20a%20Coffee-Funding-yellow?logo=buymeacoffee)](https://buymeacoffee.com/kevinveenbirkenbach) [![PayPal](https://img.shields.io/badge/Donate-PayPal-blue?logo=paypal)](https://s.veen.world/paypaldonate)
 This repository contains two bash scripts for handling duplicate files in a directory and its subdirectories.
-The scripts may need to be modified depending on the specific requirements of your system or the specific use case. They currently operate by comparing the MD5 hash of files to find duplicates, which is a common but not foolproof method.
+[![License: AGPL v3](https://img.shields.io/badge/License-AGPL%20v3-blue.svg)](./LICENSE) [![GitHub stars](https://img.shields.io/github/stars/kevinveenbirkenbach/duplicate-file-handler.svg?style=social)](https://github.com/kevinveenbirkenbach/duplicate-file-handler/stargazers)
-## Author
+Duplicate File Handler is a Python CLI tool for identifying and handling duplicate files within one or more directories based on their MD5 hashes. With flexible file-type filtering and multiple action modes, you can efficiently delete duplicates or replace them with hard or symbolic links.
-**Kevin Veen-Birkenbach**
+---
 - Email: kevin@veen.world
 - Website: [https://www.veen.world](https://www.veen.world)
-This repository was created with the help of [OpenAI's ChatGPT](https://chat.openai.com/share/013e4367-8eca-4066-8b18-55457202ba57).
+## 🛠 Features
-## Setup 
+- **Duplicate Detection:** Computes MD5 hashes for files to find duplicates.
-These scripts will help you manage duplicate files in your directories. Please make sure to adjust permissions on the scripts to be executable with `chmod +x list_duplicates.sh delete_duplicates.sh` before running. 
+- **File Type Filtering:** Process only files with a specified extension.
 - **Multiple Modification Options:** Choose to delete duplicates, replace them with hard links, or create symbolic links.
 - **Flexible Modes:** Operate in preview, interactive, or active mode to suit your workflow.
 - **Parallel Processing:** Utilizes process pooling for efficient scanning of large directories.
-## Usage
+---
-### 1. List Duplicate Files
+## 📥 Installation
-`list_duplicates.sh` is a script to list all duplicate files in a specified directory and its subdirectories. For text files, it will also display the diffs.
+Install Duplicate File Handler via [Kevin's Package Manager](https://github.com/kevinveenbirkenbach/package-manager) under the alias `dufiha`:
 ```bash
-./list_duplicates.sh /path/to/directory
+package-manager install dufiha
 ```
-### 2. Delete Duplicate Files
+This command installs the tool globally, making it available as `dufiha` in your terminal. 🚀
-`delete_duplicates.sh` is a script to find and delete duplicate files in a specified directory and its subdirectories. It will ask for confirmation before deleting each file and display the paths of its duplicates.
+---
 ## 🚀 Usage
 Run Duplicate File Handler by specifying one or more directories to scan for duplicates:
 ```bash
-./delete_duplicates.sh /path/to/directory
+dufiha [options] directory1 directory2 ...
 ```
-## License
+### Options
-This project is licensed under the terms of the [GNU Affero General Public License v3.0](https://www.gnu.org/licenses/agpl-3.0.de.html).
+- **`--apply-to`**: Directories to which modifications should be applied.
 - **`--modification`**: Action to perform on duplicates:
  - `delete` – Delete duplicate files.
  - `hardlink` – Replace duplicates with hard links.
  - `symlink` – Replace duplicates with symbolic links.
  - `show` – Only display duplicate files (default).
 - **`--mode`**: How to apply modifications:
  - `act` – Execute changes immediately.
  - `preview` – Preview changes without making any modifications.
  - `interactive` – Ask for confirmation before processing each duplicate.
 - **`-f, --file-type`**: Filter by file type (e.g., `.txt` for text files).
 ### Example Commands
 - **Preview duplicate `.txt` files in two directories:**
  ```bash
  dufiha --file-type .txt --mode preview test_dir1 test_dir2
  ```
 - **Interactively delete duplicates in a specific directory:**
  ```bash
  dufiha --apply-to test_dir2 --modification delete --mode interactive test_dir1 test_dir2
  ```
 - **Show duplicates without modifying any files:**
  ```bash
  dufiha --mode show test_dir1
  ```
 ---
 ## 🧑‍💻 Author
 Developed by **Kevin Veen-Birkenbach**  
 - 📧 [kevin@veen.world](mailto:kevin@veen.world)  
 - 🌐 [https://www.veen.world](https://www.veen.world)
 This project was enhanced with assistance from [OpenAI's ChatGPT](https://chat.openai.com/share/825931d6-1e33-40b0-8dfc-914b3f852eeb).
 ---
 ## 📜 License
 This project is licensed under the **GNU Affero General Public License, Version 3, 19 November 2007**.  
 See the [LICENSE](./LICENSE) file for details.
 ---
 ## 🤝 Contributions
 Contributions are welcome! Please feel free to fork the repository, submit pull requests, or open issues to help improve Duplicate File Handler. Let’s make file management smarter and more efficient! 😊
--- a/create_test_file_structure.py
+++ b/create_test_file_structure.py
@@ -0,0 +1,41 @@
 import os
 import shutil
 import hashlib
 import random
 import string
 def create_test_directory(base_dir, num_files=5, duplicate_files=2, depth=1):
    os.makedirs(base_dir, exist_ok=True)
    subdirs = [os.path.join(base_dir, f"subdir_{i}") for i in range(depth)]
    for subdir in subdirs:
        os.makedirs(subdir, exist_ok=True)
    for dir in [base_dir] + subdirs:
        file_names = [f"file_{i}.txt" for i in range(num_files)]
        for file_name in file_names:
            with open(os.path.join(dir, file_name), 'w') as f:
                content = ''.join(random.choices(string.ascii_lowercase, k=20))
                f.write(content)
        for i in range(min(duplicate_files, num_files)):
            original = os.path.join(dir, file_names[i])
            for dup_num in range(1, duplicate_files+1):
                duplicate = os.path.join(dir, f"dup_{dup_num}_{file_names[i]}")
                shutil.copyfile(original, duplicate)
 def copy_directory_contents(src, dst):
    if os.path.exists(dst):
        shutil.rmtree(dst)
    shutil.copytree(src, dst)
 def create_file_structure(depth, num_files, duplicate_files):
    base_dirs = ['test_dir1', 'test_dir2']
    for base_dir in base_dirs:
        create_test_directory(base_dir, num_files, duplicate_files, depth)
    copy_directory_contents('test_dir1', 'test_dir3')
    print("Test file structure created.")
 if __name__ == "__main__":
    create_file_structure(depth=2, num_files=5, duplicate_files=3)
--- a/create_test_structure.py
+++ b/create_test_structure.py
@@ -1,46 +0,0 @@
 import os
 import shutil
 import hashlib
 import random
 import string
 def create_test_directory(base_dir, num_files=5, duplicate_files=2):
    if not os.path.exists(base_dir):
        os.makedirs(base_dir)
    # Erstelle eine Liste von eindeutigen Dateinamen
    file_names = [f"file_{i}.txt" for i in range(num_files)]
    # Erstelle einige Dateien mit zufälligem Inhalt
    for file_name in file_names:
        with open(os.path.join(base_dir, file_name), 'w') as f:
            content = ''.join(random.choices(string.ascii_lowercase, k=20))
            f.write(content)
    # Erstelle Duplikate
    for i in range(duplicate_files):
        original = os.path.join(base_dir, file_names[i])
        duplicate = os.path.join(base_dir, f"dup_{file_names[i]}")
        shutil.copyfile(original, duplicate)
 def create_file_structure():
    # Basisverzeichnisse erstellen
    base_dirs = ['test_dir1', 'test_dir2']
    for base_dir in base_dirs:
        create_test_directory(base_dir)
    # Erstelle eine Datei im ersten Verzeichnis und dupliziere sie im zweiten
    with open(os.path.join('test_dir1', 'unique_file.txt'), 'w') as f:
        f.write("This is a unique file.")
    shutil.copyfile(os.path.join('test_dir1', 'unique_file.txt'),
                    os.path.join('test_dir2', 'unique_file.txt'))
    # Erstelle eine zusätzliche einzigartige Datei im zweiten Verzeichnis
    with open(os.path.join('test_dir2', 'another_unique_file.txt'), 'w') as f:
        f.write("This is another unique file.")
    print("Test file structure created.")
 if __name__ == "__main__":
    create_file_structure()
--- a/delete_duplicates.sh
+++ b/delete_duplicates.sh
@@ -1,35 +0,0 @@
 #!/bin/bash
 if [ -z "$1" ]
 then
    echo "Directory path not provided"
    exit 1
 fi
 dir="$1"
 duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
 echo "Duplicates found:"
 echo "$duplicates" | while read line
 do
    files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
    for file in ${files[@]}
    do
        echo "File: $file"
        echo "Duplicate(s) of this file:"
        for duplicate in ${files[@]}
        do
            if [ $duplicate != $file ]
            then
                echo $duplicate
            fi
        done
        echo "Do you want to delete this file? [y/N]"
        read answer
        if [[ $answer == [yY] || $answer == [yY][eE][sS] ]]
        then
            rm -i "$file"
        fi
    done
 done
--- a/list_duplicates.sh
+++ b/list_duplicates.sh
@@ -1,30 +0,0 @@
 #!/bin/bash
 if [ -z "$1" ]
 then
    echo "Directory path not provided"
    exit 1
 fi
 dir="$1"
 duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
 if [ -z "$duplicates" ]
 then
    echo "No duplicates found."
    exit 0
 fi
 echo "Duplicates found:"
 echo "$duplicates" | while read line
 do
    files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
    file_type=$(file -b --mime-type "${files[0]}")
    if [[ $file_type == text/* ]]
    then
        diff "${files[@]}"
    else
        echo "$files"
    fi
 done
--- a/main.py
+++ b/main.py
@@ -1,7 +1,9 @@
 import os
 import argparse
 import hashlib
-from collections import defaultdict
+from collections import defaultdict, Counter
 from concurrent.futures import ProcessPoolExecutor
 from tqdm import tqdm
 def md5sum(filename):
    hash_md5 = hashlib.md5()
@@ -10,65 +12,86 @@ def md5sum(filename):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()
-def find_duplicates(directories):
+def file_hashing_job(path):
-    hashes = defaultdict(list)
+    file_hash = md5sum(path)
-    for directory in directories:
+    if file_hash:
-        for root, dirs, files in os.walk(directory):
+        return file_hash, path
-            for filename in files:
+
-                path = os.path.join(root, filename)
+def find_duplicates(directories, file_type):
-                file_hash = md5sum(path)
+    with ProcessPoolExecutor() as executor:
        futures = []
        for directory in directories:
            for root, dirs, files in tqdm(os.walk(directory, followlinks=False), desc=f"Indexing files of {directory}", unit="directory"):
                for filename in files:
                    if file_type and not filename.endswith(file_type):
                        continue
                    path = os.path.join(root, filename)
                    if not os.path.islink(path):
                        futures.append(executor.submit(file_hashing_job, path))
        hashes = defaultdict(list)
        for future in tqdm(futures, desc="Processing files", unit="file"):
            result = future.result()
            if result:
                file_hash, path = result
                hashes[file_hash].append(path)
    return {file_hash: paths for file_hash, paths in hashes.items() if len(paths) > 1}
 def handle_file_modification(original_file, duplicate_file, modification):
    if modification == 'delete':
        print(f"Deleting {duplicate_file}")
        os.remove(duplicate_file)
    elif modification == 'hardlink':
        os.remove(duplicate_file)
        os.link(original_file, duplicate_file)
        print(f"Replaced {duplicate_file} with a hardlink to {original_file}")
    elif modification == 'symlink':
        os.remove(duplicate_file)
        os.symlink(original_file, duplicate_file)
        print(f"Replaced {duplicate_file} with a symlink to {original_file}")
 def handle_modification(files, modification, mode, apply_to):
-    if mode == 'preview':
+    original_file = next((f for f in files if not f.startswith(tuple(apply_to))), files[0])
-        if modification == 'show':
+    for duplicate_file in files:
-            print("Would show the following duplicate files:")
+        if duplicate_file != original_file:
-            for file in files:
+            if duplicate_file.startswith(tuple(apply_to)):
-                if file.startswith(tuple(apply_to)):
+                if mode == 'preview' and modification != 'show':
-                    print(file)
+                    print(f"Would perform {modification} on {duplicate_file}")
-    elif mode == 'act':
+                elif mode == 'act':
-        if modification == 'delete':
+                    handle_file_modification(original_file, duplicate_file, modification)
-            for file in files:
+                elif mode == 'interactive':
-                if file.startswith(tuple(apply_to)):
+                    answer = input(f"Do you want to {modification} this file? {duplicate_file} [y/N] ")
-                    print(f"Deleting {file}")
+                    if answer.lower() in ['y', 'yes']:
-                    os.remove(file)
+                        handle_file_modification(original_file, duplicate_file, modification)
-        elif modification == 'hardlink':
+            else:
-            # Implement hardlink logic here
+                print(f"Duplicate file (unmodified): {duplicate_file}")
-            pass
+        elif modification != 'show':
-        elif modification == 'symlink':
+            print(f"Original file kept: {original_file}")
-            # Implement symlink logic here
+    print()
-            pass
+
    elif mode == 'interactive':
        for file in files:
            if file.startswith(tuple(apply_to)):
                answer = input(f"Do you want to {modification} this file? {file} [y/N] ")
                if answer.lower() in ['y', 'yes']:
                    # Implement deletion, hardlink or symlink logic here
                    pass
 def main(args):
    directories = args.directories
    apply_to = args.apply_to or directories
-    duplicates = find_duplicates(directories)
+    duplicates = find_duplicates(directories,args.file_type)
    if not duplicates:
        print("No duplicates found.")
        return
    for file_hash, files in duplicates.items():
-        if args.mode == 'preview' or (args.mode == 'interactive' and args.modification == 'show'):
+        print(f"Duplicate files for hash {file_hash}:")
-            print(f"Duplicate files for hash {file_hash}:")
+        [print(file) for file in files if file.startswith(tuple(directories))]
-            [print(file) for file in files if file.startswith(tuple(apply_to))]
+        handle_modification(files, args.modification, args.mode, apply_to)
        else:
            handle_modification(files, args.modification, args.mode, apply_to)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Find and handle duplicate files.")
    parser.add_argument('directories', nargs='*', help="Directories to scan for duplicates.")
-    parser.add_argument('--apply-to', nargs='*', help="Directories to apply modifications to.")
+    parser.add_argument('--apply-to', nargs='*', help="Filter directories to apply modifications to.")
    parser.add_argument('--modification', choices=['delete', 'hardlink', 'symlink', 'show'], default='show', help="Modification to perform on duplicates.")
    parser.add_argument('--mode', choices=['act', 'preview', 'interactive'], default='preview', help="How to apply the modifications.")
    parser.add_argument('-f', '--file-type', help="Filter by file type (e.g., '.txt' for text files).", default=None)
    args = parser.parse_args()
Author	SHA1	Message	Date
Kevin Veen-Birkenbach	1e5ac1cec8	Added Funding	2025-03-12 20:52:47 +01:00
Kevin Veen-Birkenbach	206351b067	Merge branch 'main' of github.com:kevinveenbirkenbach/duplicate-file-handler	2025-03-12 11:14:39 +01:00
Kevin Veen-Birkenbach	8239040659	Update README.md	2025-03-12 10:49:37 +01:00
Kevin Veen-Birkenbach	2b06bdf6c6	Update README.md	2025-03-04 20:47:26 +01:00
Kevin Veen-Birkenbach	32669ba528	Renamed	2025-03-04 20:45:06 +01:00
Kevin Veen-Birkenbach	89e15dd023	Reimplemented by an accident deleted function	2023-11-14 15:33:32 +01:00
Kevin Veen-Birkenbach	e31ee231fa	Improved velocity with parallel processing	2023-11-14 15:26:47 +01:00
Kevin Veen-Birkenbach	5c4afe2655	Changed unit	2023-11-14 13:47:20 +01:00
Kevin Veen-Birkenbach	62c6443858	updated progress	2023-11-14 13:44:20 +01:00
Kevin Veen-Birkenbach	0c49ca0fcc	implemented progress	2023-11-14 13:39:59 +01:00
Kevin Veen-Birkenbach	8fc0d37a09	updated README.md	2023-11-14 13:23:12 +01:00
Kevin Veen-Birkenbach	a505cfa8d3	updated test script	2023-11-14 13:18:30 +01:00
Kevin Veen-Birkenbach	62dc1e1250	Solved filetype bug	2023-11-14 13:00:18 +01:00
Kevin Veen-Birkenbach	9e815c6854	Optimized show logic	2023-11-14 12:56:24 +01:00
Kevin Veen-Birkenbach	f854e46511	Deleted old scripts	2023-11-14 12:51:35 +01:00
Kevin Veen-Birkenbach	53b1d8d0fa	Deleted old scripts	2023-11-14 12:50:34 +01:00
Kevin Veen-Birkenbach	1dc5019e4e	Added filetype parameter	2023-11-14 12:50:14 +01:00
Kevin Veen-Birkenbach	ce5db7c6da	Solved symlink bug	2023-11-14 12:41:52 +01:00
Kevin Veen-Birkenbach	f5c7a945c5	solved apply to bug	2023-11-14 12:32:48 +01:00
Kevin Veen-Birkenbach	1384a5451f	Updated arguments	2023-11-14 12:17:21 +01:00
Kevin Veen-Birkenbach	be44b2984f	Added linebreak to make it easier readable	2023-11-14 11:56:02 +01:00
Kevin Veen-Birkenbach	3e159df46d	Optimized logic to keep original file	2023-11-14 11:53:32 +01:00
Kevin Veen-Birkenbach	a68530abe0	optimized logic	2023-11-14 11:44:51 +01:00
Kevin Veen-Birkenbach	7c4010c59e	Optimized modes	2023-11-14 11:41:11 +01:00