Skip to content

chore: production-readiness fixes (v0.3.0)#9

Merged
tim-dickey merged 1 commit intomainfrom
production-ready
Mar 20, 2026
Merged

chore: production-readiness fixes (v0.3.0)#9
tim-dickey merged 1 commit intomainfrom
production-ready

Conversation

@tim-dickey
Copy link
Copy Markdown
Owner

Production Readiness Changes

This PR addresses all blocking issues for a production-quality open-source Python project.

🐛 Bug Fixes

  • cli.py: Fixed JSON pair output — schema_version was incorrectly nested inside each item in the results list; it now lives at the top level along with a mode and threshold field, matching the clusters JSON structure for consistency.
  • cli.py: Added missing --min-tokens CLI flag — DuplicateFinder.scan() already accepted min_tokens but the CLI never exposed it, making the parameter silently unreachable.

🔧 Infrastructure

  • .github/workflows/ci.yml: Created the actual GitHub Actions workflow file in the correct location. The previous file (GitHub Workflows CI for Python Project.yaml) was a shell-script stub sitting at the repo root — it was never executed by GitHub Actions.
  • Added lint step (ruff) and type-check step (mypy) to CI pipeline.

📦 Packaging / Metadata

  • pyproject.toml: Fixed placeholder author ("Your Name"Tim Dickey).
  • Added license, PyPI classifiers, [project.urls] (Homepage + Bug Tracker).
  • Added ruff>=0.4.0 and mypy>=1.8.0 to [dev] extras.
  • Added [tool.ruff] and [tool.mypy] config sections.
  • Bumped version to 0.3.0.

🧹 Code Quality

  • core.py: Silent except Exception: return None replaced with logger.warning(...) so skipped files are observable at runtime.
  • core.py: Added proper type annotations to _compute_file_signature and DuplicateFinder.__init__.
  • .gitignore: Fixed .dist/dist/ and .build/build/ (the dot-prefix versions were never matching anything). Added .venv/, .DS_Store, and Thumbs.db.
  • __init__.py: Bumped __version__ to 0.3.0.

🗑️ Cleanup Needed (follow-up)

  • The root-level GitHub Workflows CI for Python Project.yaml file can be deleted once this PR merges (it served no functional purpose).
  • The .venv/ directory is still tracked — run git rm -r --cached .venv/ locally after merging.

- Fix .gitignore: add .venv/, fix dist/ and build/ paths, add common patterns
- Add .github/workflows/ci.yml (proper location, replaces root-level yaml)
- Fix pyproject.toml: real author name, add license classifier, pin ruff/mypy in dev extras, add [project.urls]
- Fix cli.py: expose --min-tokens flag, fix JSON pair output schema_version to top level
- Fix core.py: add logging for skipped files instead of silent None return
- Bump version to 0.3.0
Copilot AI review requested due to automatic review settings March 19, 2026 02:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets “production-readiness” for the duplicate-finding tool by aligning CLI behavior with documented schemas, improving observability in core scanning, tightening packaging metadata, and adding CI automation (lint/typecheck/tests).

Changes:

  • Updated CLI: added --min-tokens and adjusted JSON output structure for pair mode.
  • Improved core robustness: log skipped files instead of silently ignoring exceptions; added/expanded type hints and formatting.
  • Added CI workflow plus ruff/mypy configuration and enriched packaging metadata (version bump to 0.3.0, license/classifiers/URLs, dev tools).

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/duplicate_finder/core.py Adds logging for skipped files and improves typing/formatting around scanning and candidate selection.
src/duplicate_finder/cli.py Exposes --min-tokens and changes JSON output shape for pair mode; adds typing to CLI entrypoints.
src/duplicate_finder/__init__.py Bumps library version to 0.3.0.
pyproject.toml Updates project metadata, adds dev tooling deps, and introduces ruff/mypy config.
.gitignore Fixes build artifact ignore patterns and adds common OS/venv ignores.
.github/workflows/ci.yml Introduces GitHub Actions CI running ruff, mypy, and pytest across Python 3.9–3.12.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +64 to +78
out = {
"schema_version": 1,
"mode": "pairs",
"threshold": threshold,
"results": [
{
"similarity": round(sim, 4),
"file_a": a.path,
"file_b": b.path,
"tokens_a": a.size,
"tokens_b": b.size,
}
for sim, a, b in results
],
}
Comment on lines +17 to +40
@click.option("--min-tokens", type=int, default=0, show_default=True, help="Skip files with fewer than N tokens")
@click.option("--workers", type=int, default=0, show_default=True, help="Parallel worker processes (0 = serial)")
@click.option("--prefilter", is_flag=True, help="Enable MinHash+LSH candidate prefiltering")
@click.option("--minhash-perms", type=int, default=64, show_default=True, help="MinHash permutations when prefilter enabled")
@click.option("--lsh-bands", type=int, default=16, show_default=True, help="Number of LSH bands (must divide perms roughly)")
@click.option("--lsh-bands", type=int, default=16, show_default=True, help="Number of LSH bands (must roughly divide perms)")
@click.option("--clusters", is_flag=True, help="Output duplicate clusters instead of raw pairs")
@click.option("--json", "--json-output", is_flag=True, help="Emit JSON instead of table")
def scan(path, threshold, ext, k, workers, prefilter, minhash_perms, lsh_bands, clusters, json_output):
@click.option("--json", "json_output", is_flag=True, help="Emit JSON instead of table")
def scan(
path: str,
threshold: float,
ext: str,
k: int,
min_tokens: int,
workers: int,
prefilter: bool,
minhash_perms: int,
lsh_bands: int,
clusters: bool,
json_output: bool,
) -> None:
"""Scan PATH recursively for duplicate / near-duplicate files."""
extensions = [e.strip() for e in ext.split(",") if e.strip()]
finder = DuplicateFinder(k=k, threshold=threshold)
sigs = finder.scan(path, extensions, workers=workers)
sigs = finder.scan(path, extensions, min_tokens=min_tokens, workers=workers)
Comment on lines +17 to +44
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -e .[dev]

- name: Lint
run: |
ruff check src/ tests/

- name: Type check
run: |
mypy src/duplicate_finder --ignore-missing-imports

- name: Run tests (excluding slow)
run: |
pytest -m "not slow" -v

- name: Run slow tests (main branch only)
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
run: |
pytest -m slow -v
target-version = "py39"

[tool.ruff.lint]
select = ["E", "F", "W", "I"]
@tim-dickey tim-dickey merged commit 4527e20 into main Mar 20, 2026
4 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants