Robust hashing + duplicate discovery + safe cleanup tooling for NAS environments (Synology DSM friendly).
Safety-first design: everything is a candidate at scan time until re-verified immediately before action.
All deletion flows support dry-run, confirmation, and usually quarantine-first.
git clone https://github.com/yourusername/hasher.git
cd hasher
chmod +x launcher.sh
chmod +x bin/*.sh
nano local/paths.txt # add directories to scan
./launcher.sh # menu-driven launcherNotes
- The launcher is menu-driven; no flags on the launcher itself.
- Direct hashing:
bin/hasher.sh --pathfile local/paths.txt. - Run duplicate-folder detection before duplicate-file detection for fastest wins.
A project by James Wintermute
Contact: jameswintermute@protonmail.ch
Originally started in Dec 2022, now a fully featured NAS dedupe & hygiene suite.
👉 For full history see: version-history.md
Hasher helps protect NAS-stored data by:
- Generating cryptographic hashes (sha256 default)
- Detecting silent corruption (bitrot, ransomware, filesystem drift)
- Verifying backup rotation integrity
- Finding duplicate folders (exact tree-level matches)
- Finding duplicate files (deep review)
- Safely applying dedupe plans with quarantine
- Identifying zero-length files
- Cleaning junk / system artefacts
- Maintaining long-term NAS hygiene
- BusyBox / Synology DSM compatible
- Pure POSIX
sh - Uses standard tools:
awk,sort,stat,find,rm,mv - Recommended: install under the same volume you scan (e.g.,
/volume1/hasher)
./launcher.sh # Option 1Outputs:
hashes/hasher-YYYY-MM-DD.csvlogs/background.log- Zero-length candidates →
zero-length/
bin/find-duplicate-folders.sh --input hashes/<hashfile>.csv --mode planProduces:
logs/duplicate-folders-plan-*.txt
This is the highest-value and lowest-risk dedupe stage.
bin/apply-folder-plan.sh --plan logs/duplicate-folders-plan-*.txt --forceFolders are moved to quarantine unless configured otherwise.
bin/find-duplicates.sh --input hashes/<hashfile>.csvGenerates:
logs/YYYY-MM-DD-duplicate-hashes.txt
bin/review-duplicates.sh --from-report logs/<report>.txtFeatures:
- Keep-one-delete-rest
- Sorting orders (size, sizesmall, name, mtime)
- Exception skip list (
local/exceptions-hashes.txt) - Progress bars & ETA
- Safe numeric input
- BusyBox compatible
Outputs:
logs/review-dedupe-plan-*.txt
bin/delete-duplicates.sh --from-plan <plan> --forceSupports:
--quarantine <dir>- Multi-pass verify
- Dry-run before destructive action
bin/delete-zero-length.sh --verify-only
bin/delete-zero-length.sh --forcebin/delete-junk.sh --paths-file local/paths.txt --dry-runUses:
local/junk-extensions.txt
Shows preview with sizes, totals, and top offenders.
bin/hash-check.sh <sha256>Locate all matching files across scanned volumes.
Shows:
- Hash run count
- Latest CSV
- Dedupe plan count
- Cron template examples
Launcher → Option 14:
Deletes everything under:
var/
…but leaves hashes + logs intact.
Hasher uses an override hierarchy:
default/hasher.conf
local/hasher.conf
local/paths.txt
local/excludes.txt
local/exceptions-hashes.txt
local/junk-extensions.txt
Typical fields:
EXCLUDES_FILE=local/excludes.txt
LOW_VALUE_THRESHOLD_BYTES=0
ZERO_APPLY_EXCLUDES=false
QUARANTINE_DIR="/volume1/hasher/quarantine-$(date +%F)"Precedence:
CLI flags > local/hasher.conf > default/hasher.conf > excludes.txt > built-ins
├── bin/
│ ├── apply-file-plan.sh
│ ├── apply-folder-plan.sh
│ ├── check-deps.sh
│ ├── clean-logs.sh
│ ├── csv-dedupe-by-path.sh
│ ├── csv-quick-stats.sh
│ ├── delete-duplicates.sh
│ ├── delete-junk.sh
│ ├── delete-zero-length.sh
│ ├── du-summary.sh
│ ├── find-duplicate-folders.sh
│ ├── find-duplicates.sh
│ ├── hash-check.sh
│ ├── hasher.sh
│ ├── launch-review.sh
│ ├── lib_paths.sh
│ ├── review-batch.sh
│ ├── review-duplicates.sh
│ ├── review-junk.sh
│ ├── review-latest.sh
│ ├── run-find-duplicates.sh
│ └── schedule-hasher.sh
│
├── default/
│ └── hasher.conf
│
├── local/
│ ├── exceptions-hashes.txt
│ ├── excluded-from-dedup.txt
│ ├── excludes.txt
│ ├── hasher.conf
│ ├── junk-extensions.txt
│ └── paths.txt
│
├── logs/
│ └── .gitignore
│
├── var/
│ └── .gitignore
│
├── hashes/ # generated at runtime
├── zero-length/ # generated at runtime
│
├── launcher.sh
├── LICENSE
├── .gitignore
├── README.md
└── version-history.md
- All destructive actions require explicit
--force - All plans re-verify paths before removal
- Quarantine-first deletion where possible
- Extensive dry-run support
- CRLF-safe path handling
- BusyBox-tested execution paths
Sizes show as “??” in duplicate review
→ The system running review-duplicates.sh cannot stat NAS paths.
Run reviews directly on the NAS (SSH).
CSV appears corrupted
→ Fix CRLF endings:
sed -i 's/
$//' file.csvDuplicate plan seems incomplete
→ Always run folder-dedupe before file-dedupe.
GPLv3.
Facebook — Silent Data Corruption
https://engineering.fb.com/2021/02/23/data-infrastructure/silent-data-corruption/