Compare commits

...

19 Commits

Author SHA1 Message Date
Kevin Veen-Birkenbach 89e15dd023 Reimplemented by an accident deleted function 2023-11-14 15:33:32 +01:00
Kevin Veen-Birkenbach e31ee231fa Improved velocity with parallel processing 2023-11-14 15:26:47 +01:00
Kevin Veen-Birkenbach 5c4afe2655 Changed unit 2023-11-14 13:47:20 +01:00
Kevin Veen-Birkenbach 62c6443858 updated progress 2023-11-14 13:44:20 +01:00
Kevin Veen-Birkenbach 0c49ca0fcc implemented progress 2023-11-14 13:39:59 +01:00
Kevin Veen-Birkenbach 8fc0d37a09 updated README.md 2023-11-14 13:23:12 +01:00
Kevin Veen-Birkenbach a505cfa8d3 updated test script 2023-11-14 13:18:30 +01:00
Kevin Veen-Birkenbach 62dc1e1250 Solved filetype bug 2023-11-14 13:00:18 +01:00
Kevin Veen-Birkenbach 9e815c6854 Optimized show logic 2023-11-14 12:56:24 +01:00
Kevin Veen-Birkenbach f854e46511 Deleted old scripts 2023-11-14 12:51:35 +01:00
Kevin Veen-Birkenbach 53b1d8d0fa Deleted old scripts 2023-11-14 12:50:34 +01:00
Kevin Veen-Birkenbach 1dc5019e4e Added filetype parameter 2023-11-14 12:50:14 +01:00
Kevin Veen-Birkenbach ce5db7c6da Solved symlink bug 2023-11-14 12:41:52 +01:00
Kevin Veen-Birkenbach f5c7a945c5 solved apply to bug 2023-11-14 12:32:48 +01:00
Kevin Veen-Birkenbach 1384a5451f Updated arguments 2023-11-14 12:17:21 +01:00
Kevin Veen-Birkenbach be44b2984f Added linebreak to make it easier readable 2023-11-14 11:56:02 +01:00
Kevin Veen-Birkenbach 3e159df46d Optimized logic to keep original file 2023-11-14 11:53:32 +01:00
Kevin Veen-Birkenbach a68530abe0 optimized logic 2023-11-14 11:44:51 +01:00
Kevin Veen-Birkenbach 7c4010c59e Optimized modes 2023-11-14 11:41:11 +01:00
7 changed files with 136 additions and 167 deletions

3
.gitignore vendored
View File

@ -1,2 +1 @@
test_dir1
test_dir2
test_dir*

View File

@ -1,38 +1,55 @@
# Duplicate File Handler
This repository contains two bash scripts for handling duplicate files in a directory and its subdirectories.
The scripts may need to be modified depending on the specific requirements of your system or the specific use case. They currently operate by comparing the MD5 hash of files to find duplicates, which is a common but not foolproof method.
This repository contains a Python script for identifying and handling duplicate files in a directory and its subdirectories based on their MD5 hash. It allows for filtering by file type and provides options for handling duplicates such as deletion, hard linking, or sym linking.
## Author
**Kevin Veen-Birkenbach**
- Kevin Veen-Birkenbach
- Email: kevin@veen.world
- Website: [https://www.veen.world](https://www.veen.world)
This repository was created with the help of [OpenAI's ChatGPT](https://chat.openai.com/share/013e4367-8eca-4066-8b18-55457202ba57).
This repository was enhanced with the help of [OpenAI's ChatGPT](https://chat.openai.com/share/825931d6-1e33-40b0-8dfc-914b3f852eeb).
## Setup
These scripts will help you manage duplicate files in your directories. Please make sure to adjust permissions on the scripts to be executable with `chmod +x list_duplicates.sh delete_duplicates.sh` before running.
To use the script, ensure you have Python installed on your system. No additional libraries are required as the script uses standard Python libraries.
## Usage
### 1. List Duplicate Files
### Identifying and Handling Duplicates
`list_duplicates.sh` is a script to list all duplicate files in a specified directory and its subdirectories. For text files, it will also display the diffs.
`main.py` is a Python script to identify all duplicate files in the specified directories. It can also filter by file type and handle duplicates by deleting them or replacing them with hard or symbolic links.
```bash
./list_duplicates.sh /path/to/directory
python main.py [options] directories
```
### 2. Delete Duplicate Files
#### Options
- `--apply-to`: Directories to apply modifications to.
- `--modification`: Action to perform on duplicates - `delete`, `hardlink`, `symlink`, or `show` (default).
- `--mode`: How to apply the modifications - `act`, `preview`, `interactive` (default: `preview`).
- `-f`, `--file-type`: Filter by file type (e.g., `.txt` for text files).
`delete_duplicates.sh` is a script to find and delete duplicate files in a specified directory and its subdirectories. It will ask for confirmation before deleting each file and display the paths of its duplicates.
### Creating Test File Structure
`create_file_structure.py` is a utility script to create a test file structure with duplicate files for testing purposes.
```bash
./delete_duplicates.sh /path/to/directory
python create_file_structure.py
```
## Example
To preview duplicate `.txt` files in `test_dir1` and `test_dir2`:
```bash
python main.py --file-type .txt --mode preview test_dir1 test_dir2
```
To interactively delete duplicates in `test_dir2`:
```bash
python main.py --apply-to test_dir2 --modification delete --mode interactive test_dir1 test_dir2
```
## License
This project is licensed under the terms of the [GNU Affero General Public License v3.0](https://www.gnu.org/licenses/agpl-3.0.de.html).
This project is licensed under the terms of the [MIT License](LICENSE).

41
create_file_structure.py Normal file
View File

@ -0,0 +1,41 @@
import os
import shutil
import hashlib
import random
import string
def create_test_directory(base_dir, num_files=5, duplicate_files=2, depth=1):
os.makedirs(base_dir, exist_ok=True)
subdirs = [os.path.join(base_dir, f"subdir_{i}") for i in range(depth)]
for subdir in subdirs:
os.makedirs(subdir, exist_ok=True)
for dir in [base_dir] + subdirs:
file_names = [f"file_{i}.txt" for i in range(num_files)]
for file_name in file_names:
with open(os.path.join(dir, file_name), 'w') as f:
content = ''.join(random.choices(string.ascii_lowercase, k=20))
f.write(content)
for i in range(min(duplicate_files, num_files)):
original = os.path.join(dir, file_names[i])
for dup_num in range(1, duplicate_files+1):
duplicate = os.path.join(dir, f"dup_{dup_num}_{file_names[i]}")
shutil.copyfile(original, duplicate)
def copy_directory_contents(src, dst):
if os.path.exists(dst):
shutil.rmtree(dst)
shutil.copytree(src, dst)
def create_file_structure(depth, num_files, duplicate_files):
base_dirs = ['test_dir1', 'test_dir2']
for base_dir in base_dirs:
create_test_directory(base_dir, num_files, duplicate_files, depth)
copy_directory_contents('test_dir1', 'test_dir3')
print("Test file structure created.")
if __name__ == "__main__":
create_file_structure(depth=2, num_files=5, duplicate_files=3)

View File

@ -1,46 +0,0 @@
import os
import shutil
import hashlib
import random
import string
def create_test_directory(base_dir, num_files=5, duplicate_files=2):
if not os.path.exists(base_dir):
os.makedirs(base_dir)
# Erstelle eine Liste von eindeutigen Dateinamen
file_names = [f"file_{i}.txt" for i in range(num_files)]
# Erstelle einige Dateien mit zufälligem Inhalt
for file_name in file_names:
with open(os.path.join(base_dir, file_name), 'w') as f:
content = ''.join(random.choices(string.ascii_lowercase, k=20))
f.write(content)
# Erstelle Duplikate
for i in range(duplicate_files):
original = os.path.join(base_dir, file_names[i])
duplicate = os.path.join(base_dir, f"dup_{file_names[i]}")
shutil.copyfile(original, duplicate)
def create_file_structure():
# Basisverzeichnisse erstellen
base_dirs = ['test_dir1', 'test_dir2']
for base_dir in base_dirs:
create_test_directory(base_dir)
# Erstelle eine Datei im ersten Verzeichnis und dupliziere sie im zweiten
with open(os.path.join('test_dir1', 'unique_file.txt'), 'w') as f:
f.write("This is a unique file.")
shutil.copyfile(os.path.join('test_dir1', 'unique_file.txt'),
os.path.join('test_dir2', 'unique_file.txt'))
# Erstelle eine zusätzliche einzigartige Datei im zweiten Verzeichnis
with open(os.path.join('test_dir2', 'another_unique_file.txt'), 'w') as f:
f.write("This is another unique file.")
print("Test file structure created.")
if __name__ == "__main__":
create_file_structure()

View File

@ -1,35 +0,0 @@
#!/bin/bash
if [ -z "$1" ]
then
echo "Directory path not provided"
exit 1
fi
dir="$1"
duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
echo "Duplicates found:"
echo "$duplicates" | while read line
do
files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
for file in ${files[@]}
do
echo "File: $file"
echo "Duplicate(s) of this file:"
for duplicate in ${files[@]}
do
if [ $duplicate != $file ]
then
echo $duplicate
fi
done
echo "Do you want to delete this file? [y/N]"
read answer
if [[ $answer == [yY] || $answer == [yY][eE][sS] ]]
then
rm -i "$file"
fi
done
done

View File

@ -1,30 +0,0 @@
#!/bin/bash
if [ -z "$1" ]
then
echo "Directory path not provided"
exit 1
fi
dir="$1"
duplicates=$(find "$dir" -type f -exec md5sum {} + | sort | uniq -d -w32)
if [ -z "$duplicates" ]
then
echo "No duplicates found."
exit 0
fi
echo "Duplicates found:"
echo "$duplicates" | while read line
do
files=$(grep "$line" <<< "$duplicates" | awk '{print $2}')
file_type=$(file -b --mime-type "${files[0]}")
if [[ $file_type == text/* ]]
then
diff "${files[@]}"
else
echo "$files"
fi
done

103
main.py
View File

@ -1,7 +1,9 @@
import os
import argparse
import hashlib
from collections import defaultdict
from collections import defaultdict, Counter
from concurrent.futures import ProcessPoolExecutor
from tqdm import tqdm
def md5sum(filename):
hash_md5 = hashlib.md5()
@ -10,65 +12,86 @@ def md5sum(filename):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def find_duplicates(directories):
hashes = defaultdict(list)
for directory in directories:
for root, dirs, files in os.walk(directory):
for filename in files:
path = os.path.join(root, filename)
file_hash = md5sum(path)
def file_hashing_job(path):
file_hash = md5sum(path)
if file_hash:
return file_hash, path
def find_duplicates(directories, file_type):
with ProcessPoolExecutor() as executor:
futures = []
for directory in directories:
for root, dirs, files in tqdm(os.walk(directory, followlinks=False), desc=f"Indexing files of {directory}", unit="directory"):
for filename in files:
if file_type and not filename.endswith(file_type):
continue
path = os.path.join(root, filename)
if not os.path.islink(path):
futures.append(executor.submit(file_hashing_job, path))
hashes = defaultdict(list)
for future in tqdm(futures, desc="Processing files", unit="file"):
result = future.result()
if result:
file_hash, path = result
hashes[file_hash].append(path)
return {file_hash: paths for file_hash, paths in hashes.items() if len(paths) > 1}
def handle_file_modification(original_file, duplicate_file, modification):
if modification == 'delete':
print(f"Deleting {duplicate_file}")
os.remove(duplicate_file)
elif modification == 'hardlink':
os.remove(duplicate_file)
os.link(original_file, duplicate_file)
print(f"Replaced {duplicate_file} with a hardlink to {original_file}")
elif modification == 'symlink':
os.remove(duplicate_file)
os.symlink(original_file, duplicate_file)
print(f"Replaced {duplicate_file} with a symlink to {original_file}")
def handle_modification(files, modification, mode, apply_to):
if mode == 'preview':
if modification == 'show':
print("Would show the following duplicate files:")
for file in files:
if file.startswith(tuple(apply_to)):
print(file)
elif mode == 'act':
if modification == 'delete':
for file in files:
if file.startswith(tuple(apply_to)):
print(f"Deleting {file}")
os.remove(file)
elif modification == 'hardlink':
# Implement hardlink logic here
pass
elif modification == 'symlink':
# Implement symlink logic here
pass
elif mode == 'interactive':
for file in files:
if file.startswith(tuple(apply_to)):
answer = input(f"Do you want to {modification} this file? {file} [y/N] ")
if answer.lower() in ['y', 'yes']:
# Implement deletion, hardlink or symlink logic here
pass
original_file = next((f for f in files if not f.startswith(tuple(apply_to))), files[0])
for duplicate_file in files:
if duplicate_file != original_file:
if duplicate_file.startswith(tuple(apply_to)):
if mode == 'preview' and modification != 'show':
print(f"Would perform {modification} on {duplicate_file}")
elif mode == 'act':
handle_file_modification(original_file, duplicate_file, modification)
elif mode == 'interactive':
answer = input(f"Do you want to {modification} this file? {duplicate_file} [y/N] ")
if answer.lower() in ['y', 'yes']:
handle_file_modification(original_file, duplicate_file, modification)
else:
print(f"Duplicate file (unmodified): {duplicate_file}")
elif modification != 'show':
print(f"Original file kept: {original_file}")
print()
def main(args):
directories = args.directories
apply_to = args.apply_to or directories
duplicates = find_duplicates(directories)
duplicates = find_duplicates(directories,args.file_type)
if not duplicates:
print("No duplicates found.")
return
for file_hash, files in duplicates.items():
if args.mode == 'preview' or (args.mode == 'interactive' and args.modification == 'show'):
print(f"Duplicate files for hash {file_hash}:")
[print(file) for file in files if file.startswith(tuple(apply_to))]
else:
handle_modification(files, args.modification, args.mode, apply_to)
print(f"Duplicate files for hash {file_hash}:")
[print(file) for file in files if file.startswith(tuple(directories))]
handle_modification(files, args.modification, args.mode, apply_to)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Find and handle duplicate files.")
parser.add_argument('directories', nargs='*', help="Directories to scan for duplicates.")
parser.add_argument('--apply-to', nargs='*', help="Directories to apply modifications to.")
parser.add_argument('--apply-to', nargs='*', help="Filter directories to apply modifications to.")
parser.add_argument('--modification', choices=['delete', 'hardlink', 'symlink', 'show'], default='show', help="Modification to perform on duplicates.")
parser.add_argument('--mode', choices=['act', 'preview', 'interactive'], default='preview', help="How to apply the modifications.")
parser.add_argument('-f', '--file-type', help="Filter by file type (e.g., '.txt' for text files).", default=None)
args = parser.parse_args()