Skip to content

Automation: Direction #39

@github-actions

Description

@github-actions

Last generated: 2026-01-22T18:36:57.463Z
Provider: openai
Model: gpt-5.2

Summary

This repo is a small Python package with CLI entrypoints (soft404/__main__.py) plus model artifacts (soft404/clf.joblib) and tests (tests/). Automation is currently noisy/duplicative (many .github/workflows/auto-*), and the repo contains generated/local tooling artifacts (.bish.sqlite, .bish-index) and a huge jar (bfg-1.15.0.jar) that likely don’t belong in normal CI workflows. Primary reliability gains come from: (1) standardizing CI around tox, (2) tightening packaging/model-artifact handling, and (3) reducing workflow sprawl to a small set of deterministic checks.

Direction (what and why)

Direction: Make CI deterministic and lightweight by centering it on tox + a single GitHub Actions CI workflow, while cleaning up tracked/generated artifacts and ensuring the CLI + prediction path is tested consistently.

Why:

  • tox.ini exists, but CI appears to be dominated by many “auto-*” workflows that can cause duplicated runs, inconsistent environments, and maintenance overhead.
  • Tracked .bish.sqlite / .bish-index files and bfg-1.15.0.jar inflate the repo and can introduce accidental churn; they also make actions slower and PRs noisier.
  • The package includes a trained model file; we should ensure tests don’t depend on network, and clarify whether the model is shipped in the wheel/sdist and loaded reliably at runtime.

Plan (next 1-3 steps)

1) Add a single, canonical CI workflow that runs tox (and make it the required check)

Files:

  • Add: .github/workflows/ci.yml
  • Potentially update: tox.ini (only if currently not runnable on modern Python)

Concrete workflow content (minimal and stable):

  • Trigger: pull_request, push to master
  • Matrix: Python 3.10, 3.11, 3.12 (or align with what tox.ini supports)
  • Steps:
    • actions/checkout@v4
    • actions/setup-python@v5
    • pip install -U pip setuptools wheel tox
    • tox -q

Why this step first: it gives a single source of truth for correctness, reduces reliance on the “auto-*” ecosystem, and improves signal-to-noise for PRs.

Optional hardening (small):

  • Add pip caching via actions/setup-python cache: pip
  • Set TOX_SKIP_MISSING_INTERPRETERS=false so missing envs fail fast (or true if you want local dev friendliness—decide explicitly).

2) Stop tracking generated/local artifacts; prevent future churn

Files:

  • Update .gitignore to include:
    • **/.bish.sqlite
    • **/.bish-index
    • *.sqlite
  • Remove from git index (in a PR):
    • .bish-index, .bish.sqlite
    • crawler/.bish-*, notebooks/.bish-*, soft404/.bish-*, tests/.bish-*
    • .github/.bish.sqlite

Command sequence (one-time cleanup):

  • git rm -f --cached .bish-index .bish.sqlite
  • git rm -f --cached **/.bish-index **/.bish.sqlite .github/.bish.sqlite
  • Commit with message like: chore: remove bish artifacts from repo

Why: reduces PR noise and avoids accidental binary diffs; improves clone/CI performance.

3) Make packaging/model loading explicit and tested

Files to inspect/update:

  • setup.py, MANIFEST.in, soft404/utils.py (or wherever model is loaded), tests/test_predict.py

Concrete actions:

  • Ensure soft404/clf.joblib is included in wheel/sdist:
    • In setup.py, include package_data={"soft404": ["clf.joblib"]} (or use include_package_data=True with MANIFEST.in listing it).
    • In MANIFEST.in, add: include soft404/clf.joblib
  • Ensure model loading uses importlib.resources (Py3.9+) instead of relative filesystem assumptions:
    • e.g., importlib.resources.files("soft404").joinpath("clf.joblib")
  • Add/adjust a test that installs the package in an isolated env (tox can do this) and runs:
    • python -m soft404 --help
    • a minimal prediction call that exercises loading the packaged model

Why: prevents “works from source tree but fails when installed” issues, especially important for CLI tools.

Risks/unknowns

  • Workflow sprawl: Many .github/workflows/auto-*.yml may be centrally managed; deleting them could conflict with org policy. If you can’t remove them, at least ensure they don’t duplicate CI or mark them workflow_dispatch only.
  • Python compatibility: scrapy, scipy, and lxml can constrain Python versions. Confirm supported versions in tox.ini and setup.py classifiers.
  • Model artifact size and licensing: soft404/clf.joblib is a binary; ensure it’s acceptable to ship and that its size doesn’t bloat releases. If it’s large or frequently changing, consider moving to Git LFS or downloading in a controlled release step (but that adds complexity).
  • bfg-1.15.0.jar: If this was committed for a one-off history rewrite, it should probably be removed. But removing it changes repo content; verify it’s not referenced in docs/workflows.

Suggested tests

  1. CI via tox
    • tox -q locally and in GitHub Actions.
  2. CLI smoke tests
    • python -m soft404 --help
    • python -m soft404 predict <known_input> (or equivalent CLI if documented in README.rst)
  3. Packaging test
    • In tox, add an env that does python -m pip install . (or builds wheel) and runs a prediction that loads clf.joblib.
  4. Unit tests
    • Existing: pytest -q via tox.
  5. Repo hygiene
    • Verify git status is clean after running typical dev commands; .bish* should not appear.

Verification checklist (quick)

  • .github/workflows/ci.yml runs on PRs and passes on master
  • tox is the single canonical command developers can run
  • .bish.sqlite / .bish-index no longer tracked and are ignored
  • soft404/clf.joblib is included in built distributions and loadable when installed
  • CLI smoke test passes in CI

Metadata

Metadata

Assignees

No one assigned

    Labels

    automationAutomation-generated direction and planning

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions