Deduplication-toolkit

This toolkit enables flexible, safe, and domain-aware deduplication of structured file collections, especially suited for natural history and museum digitization projects

🚀 Features

Hash-based duplicate detection
Regex-based UID filtering to avoid false positives
Optional deletion module with safety confirmation and dry-run mode
Priority-based filtering using file metadata (size, creation date)
Modular scripts for transparency and custom pipelines

📂 Components

`find_duplicate_files.py`

Scans directories
Computes SHA256 hashes
Generates duplicates_report.txt

`refine_duplicates_by_uid.py`

Filters detected duplicates using user-provided regex
Example regex: r'_+(?P<uid>[a-z0-9]+)_+'

`delete_duplicates.py`

Deletes secondary duplicates after manual confirmation
Supports dry-run for previewing deletions

`priority_filters.py`

Logic for choosing which duplicate to keep:
- Oldest
- Largest
- Shortest filename

🛠️ Features

🧬 Content-based duplicate detection using SHA256 hashes
🔍 UID-aware filtering using regex to avoid false positives
🗃️ Metadata logging to both JSON and SQLite for auditability
🧼 Safe deletion tools with dry-run and user confirmation
⚖️ Priority retention logic by file size, ctime, filename lengt

🔧 Setup

pip install -r requirements.txt

🧪 Usage Example

# Step 1: Find initial duplicates
python find_duplicate_files.py /data --output duplicates_report.txt

# Step 2: Refine using UID logic
python refine_duplicates_by_uid.py duplicates_report.txt -r '_+(?P<uid>[a-z0-9]+)_+' -e .txt -o filtered_duplicates_report.txt

# Step 3: Preview and confirm deletions
python delete_duplicates.py filtered_duplicates_report.txt --dry-run

🧪 Example Usage 2

python find_duplicate_files.py /data --output duplicates_report.txt
python refine_duplicates_by_uid.py duplicates_report.txt -r '(?P<uid>[a-f0-9]{6,7})' -e .txt -o filtered_duplicates_report.txt
python storage/json_logger.py filtered_duplicates_report.txt
python storage/sqlite_logger.py filtered_duplicates_report.txt
python delete_duplicates.py filtered_duplicates_report.txt --dry-run

📜 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
examples		examples
logs		logs
storage		storage
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yml		config.yml
delete_duplicates.py		delete_duplicates.py
find_duplicate_files.py		find_duplicate_files.py
priority_filters.py		priority_filters.py
refine_duplicates_by_uid.py		refine_duplicates_by_uid.py
requirements.txt		requirements.txt
webhook.py		webhook.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Deduplication-toolkit

🚀 Features

📂 Components

`find_duplicate_files.py`

`refine_duplicates_by_uid.py`

`delete_duplicates.py`

`priority_filters.py`

🛠️ Features

🔧 Setup

🧪 Usage Example

🧪 Example Usage 2

📜 License

About

Uh oh!

Releases

Packages

Languages

License

JoMfN/Deduplication-toolkit

Folders and files

Latest commit

History

Repository files navigation

Deduplication-toolkit

🚀 Features

📂 Components

find_duplicate_files.py

refine_duplicates_by_uid.py

delete_duplicates.py

priority_filters.py

🛠️ Features

🔧 Setup

🧪 Usage Example

🧪 Example Usage 2

📜 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`find_duplicate_files.py`

`refine_duplicates_by_uid.py`

`delete_duplicates.py`

`priority_filters.py`

Packages