This toolkit enables flexible, safe, and domain-aware deduplication of structured file collections, especially suited for natural history and museum digitization projects
- Hash-based duplicate detection
- Regex-based UID filtering to avoid false positives
- Optional deletion module with safety confirmation and dry-run mode
- Priority-based filtering using file metadata (size, creation date)
- Modular scripts for transparency and custom pipelines
- Scans directories
- Computes SHA256 hashes
- Generates
duplicates_report.txt
- Filters detected duplicates using user-provided regex
- Example regex:
r'_+(?P<uid>[a-z0-9]+)_+'
- Deletes secondary duplicates after manual confirmation
- Supports dry-run for previewing deletions
- Logic for choosing which duplicate to keep:
- Oldest
- Largest
- Shortest filename
- 🧬 Content-based duplicate detection using SHA256 hashes
- 🔍 UID-aware filtering using regex to avoid false positives
- 🗃️ Metadata logging to both JSON and SQLite for auditability
- 🧼 Safe deletion tools with dry-run and user confirmation
- ⚖️ Priority retention logic by file size, ctime, filename lengt
pip install -r requirements.txt# Step 1: Find initial duplicates
python find_duplicate_files.py /data --output duplicates_report.txt
# Step 2: Refine using UID logic
python refine_duplicates_by_uid.py duplicates_report.txt -r '_+(?P<uid>[a-z0-9]+)_+' -e .txt -o filtered_duplicates_report.txt
# Step 3: Preview and confirm deletions
python delete_duplicates.py filtered_duplicates_report.txt --dry-runpython find_duplicate_files.py /data --output duplicates_report.txt
python refine_duplicates_by_uid.py duplicates_report.txt -r '(?P<uid>[a-f0-9]{6,7})' -e .txt -o filtered_duplicates_report.txt
python storage/json_logger.py filtered_duplicates_report.txt
python storage/sqlite_logger.py filtered_duplicates_report.txt
python delete_duplicates.py filtered_duplicates_report.txt --dry-runMIT