chore: production-readiness fixes (v0.3.0)#9
Merged
tim-dickey merged 1 commit intomainfrom Mar 20, 2026
Merged
Conversation
- Fix .gitignore: add .venv/, fix dist/ and build/ paths, add common patterns - Add .github/workflows/ci.yml (proper location, replaces root-level yaml) - Fix pyproject.toml: real author name, add license classifier, pin ruff/mypy in dev extras, add [project.urls] - Fix cli.py: expose --min-tokens flag, fix JSON pair output schema_version to top level - Fix core.py: add logging for skipped files instead of silent None return - Bump version to 0.3.0
There was a problem hiding this comment.
Pull request overview
This PR targets “production-readiness” for the duplicate-finding tool by aligning CLI behavior with documented schemas, improving observability in core scanning, tightening packaging metadata, and adding CI automation (lint/typecheck/tests).
Changes:
- Updated CLI: added
--min-tokensand adjusted JSON output structure for pair mode. - Improved core robustness: log skipped files instead of silently ignoring exceptions; added/expanded type hints and formatting.
- Added CI workflow plus ruff/mypy configuration and enriched packaging metadata (version bump to 0.3.0, license/classifiers/URLs, dev tools).
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/duplicate_finder/core.py |
Adds logging for skipped files and improves typing/formatting around scanning and candidate selection. |
src/duplicate_finder/cli.py |
Exposes --min-tokens and changes JSON output shape for pair mode; adds typing to CLI entrypoints. |
src/duplicate_finder/__init__.py |
Bumps library version to 0.3.0. |
pyproject.toml |
Updates project metadata, adds dev tooling deps, and introduces ruff/mypy config. |
.gitignore |
Fixes build artifact ignore patterns and adds common OS/venv ignores. |
.github/workflows/ci.yml |
Introduces GitHub Actions CI running ruff, mypy, and pytest across Python 3.9–3.12. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Comment on lines
+64
to
+78
| out = { | ||
| "schema_version": 1, | ||
| "mode": "pairs", | ||
| "threshold": threshold, | ||
| "results": [ | ||
| { | ||
| "similarity": round(sim, 4), | ||
| "file_a": a.path, | ||
| "file_b": b.path, | ||
| "tokens_a": a.size, | ||
| "tokens_b": b.size, | ||
| } | ||
| for sim, a, b in results | ||
| ], | ||
| } |
Comment on lines
+17
to
+40
| @click.option("--min-tokens", type=int, default=0, show_default=True, help="Skip files with fewer than N tokens") | ||
| @click.option("--workers", type=int, default=0, show_default=True, help="Parallel worker processes (0 = serial)") | ||
| @click.option("--prefilter", is_flag=True, help="Enable MinHash+LSH candidate prefiltering") | ||
| @click.option("--minhash-perms", type=int, default=64, show_default=True, help="MinHash permutations when prefilter enabled") | ||
| @click.option("--lsh-bands", type=int, default=16, show_default=True, help="Number of LSH bands (must divide perms roughly)") | ||
| @click.option("--lsh-bands", type=int, default=16, show_default=True, help="Number of LSH bands (must roughly divide perms)") | ||
| @click.option("--clusters", is_flag=True, help="Output duplicate clusters instead of raw pairs") | ||
| @click.option("--json", "--json-output", is_flag=True, help="Emit JSON instead of table") | ||
| def scan(path, threshold, ext, k, workers, prefilter, minhash_perms, lsh_bands, clusters, json_output): | ||
| @click.option("--json", "json_output", is_flag=True, help="Emit JSON instead of table") | ||
| def scan( | ||
| path: str, | ||
| threshold: float, | ||
| ext: str, | ||
| k: int, | ||
| min_tokens: int, | ||
| workers: int, | ||
| prefilter: bool, | ||
| minhash_perms: int, | ||
| lsh_bands: int, | ||
| clusters: bool, | ||
| json_output: bool, | ||
| ) -> None: | ||
| """Scan PATH recursively for duplicate / near-duplicate files.""" | ||
| extensions = [e.strip() for e in ext.split(",") if e.strip()] | ||
| finder = DuplicateFinder(k=k, threshold=threshold) | ||
| sigs = finder.scan(path, extensions, workers=workers) | ||
| sigs = finder.scan(path, extensions, min_tokens=min_tokens, workers=workers) |
Comment on lines
+17
to
+44
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Python ${{ matrix.python-version }} | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: ${{ matrix.python-version }} | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| pip install -e .[dev] | ||
|
|
||
| - name: Lint | ||
| run: | | ||
| ruff check src/ tests/ | ||
|
|
||
| - name: Type check | ||
| run: | | ||
| mypy src/duplicate_finder --ignore-missing-imports | ||
|
|
||
| - name: Run tests (excluding slow) | ||
| run: | | ||
| pytest -m "not slow" -v | ||
|
|
||
| - name: Run slow tests (main branch only) | ||
| if: github.event_name == 'push' && github.ref == 'refs/heads/main' | ||
| run: | | ||
| pytest -m slow -v |
| target-version = "py39" | ||
|
|
||
| [tool.ruff.lint] | ||
| select = ["E", "F", "W", "I"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Production Readiness Changes
This PR addresses all blocking issues for a production-quality open-source Python project.
🐛 Bug Fixes
cli.py: Fixed JSON pair output —schema_versionwas incorrectly nested inside each item in the results list; it now lives at the top level along with amodeandthresholdfield, matching the clusters JSON structure for consistency.cli.py: Added missing--min-tokensCLI flag —DuplicateFinder.scan()already acceptedmin_tokensbut the CLI never exposed it, making the parameter silently unreachable.🔧 Infrastructure
.github/workflows/ci.yml: Created the actual GitHub Actions workflow file in the correct location. The previous file (GitHub Workflows CI for Python Project.yaml) was a shell-script stub sitting at the repo root — it was never executed by GitHub Actions.ruff) and type-check step (mypy) to CI pipeline.📦 Packaging / Metadata
pyproject.toml: Fixed placeholder author ("Your Name"→Tim Dickey).license, PyPIclassifiers,[project.urls](Homepage + Bug Tracker).ruff>=0.4.0andmypy>=1.8.0to[dev]extras.[tool.ruff]and[tool.mypy]config sections.0.3.0.🧹 Code Quality
core.py: Silentexcept Exception: return Nonereplaced withlogger.warning(...)so skipped files are observable at runtime.core.py: Added proper type annotations to_compute_file_signatureandDuplicateFinder.__init__..gitignore: Fixed.dist/→dist/and.build/→build/(the dot-prefix versions were never matching anything). Added.venv/,.DS_Store, andThumbs.db.__init__.py: Bumped__version__to0.3.0.🗑️ Cleanup Needed (follow-up)
GitHub Workflows CI for Python Project.yamlfile can be deleted once this PR merges (it served no functional purpose)..venv/directory is still tracked — rungit rm -r --cached .venv/locally after merging.