Skip to content

add portability invariant and determinism CI jobs #15

@cchinchilla-dev

Description

@cchinchilla-dev

Description

Two invariants from the planning notes must be CI-enforced from 0.2.0 onward:

  1. Portabilitypip install agentanvil (no extras) must import every top-level module and run the quickstart via DirectBackend. If any AgentLoom symbol is importable, CI fails.
  2. Determinism — record a run once, replay 100 times, assert byte-for-byte-identical JSON reports. One of those replays runs on a second runner (different OS or different Python minor) and must still match.

Without CI enforcement, either invariant drifts silently the moment a developer imports AgentLoom at module top-level, or a non-deterministic source leaks in (time.time(), uuid4(), unsorted JSON keys).

Proposal

1. Portability invariant job:

# .github/workflows/ci.yml
portability-invariant:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-python@v5
      with:
        python-version: "3.12"
    - name: Install without extras
      run: |
        python -m venv /tmp/base-venv
        /tmp/base-venv/bin/pip install -e .
    - name: Verify no AgentLoom
      run: |
        /tmp/base-venv/bin/python -c "
        import importlib
        try:
            importlib.import_module('agentloom')
            raise SystemExit('FAIL: agentloom is importable under no-extras install')
        except ModuleNotFoundError:
            pass
        "
    - name: Import every top-level module
      run: |
        /tmp/base-venv/bin/python -c "
        import importlib
        for mod in ['agentanvil', 'agentanvil.core.contracts', 'agentanvil.backends', 
                   'agentanvil.backends.direct', 'agentanvil.runner.subprocess',
                   'agentanvil.evaluator.objective', 'agentanvil.evaluator.llm_judge',
                   'agentanvil.reporter', 'agentanvil.cli.main']:
            importlib.import_module(mod)
        "
    - name: Run quickstart via DirectBackend
      run: |
        cd examples/quickstart-langchain
        /tmp/base-venv/bin/pip install -r requirements.txt
        /tmp/base-venv/bin/agentanvil replay --recording recordings/quickstart.json --contract contract.yaml

2. Determinism invariant job:

determinism-invariant:
  strategy:
    matrix:
      os: [ubuntu-latest, macos-latest]
      python: ["3.11", "3.12", "3.13"]
  runs-on: ${{ matrix.os }}
  steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-python@v5
      with:
        python-version: ${{ matrix.python }}
    - run: pip install -e ".[]"
    - name: Record once
      run: |
        agentanvil replay \
          --recording examples/quickstart-langchain/recordings/quickstart.json \
          --contract examples/quickstart-langchain/contract.yaml \
          --out-dir ./baseline
    - name: Replay 100x and compare
      run: |
        for i in $(seq 1 100); do
          agentanvil replay \
            --recording examples/quickstart-langchain/recordings/quickstart.json \
            --contract examples/quickstart-langchain/contract.yaml \
            --out-dir ./replay-$i
          diff -q baseline/*.json replay-$i/*.json || exit 1
        done

The matrix forces at least one run on a "second runner" (different OS + minor Python) per PR.

Scope

  • .github/workflows/ci.yml — two new jobs.
  • tests/ci/test_portability.py — mirror of the workflow for local dev.
  • tests/ci/test_determinism_smoke.py — short smoke test (≤ 5 replays) runnable without CI.
  • scripts/determinism_stress.sh — full 100-replay script for local investigation.

Regression tests

  • portability-invariant CI job green on every PR.
  • determinism-invariant CI job green on every PR across matrix.
  • test_smoke_record_then_replay_5x_identical

Notes

  • The 100-replay determinism gate is slow (~3 min on Linux, ~6 min on macOS); consider restricting to main branch + nightly once stable.
  • If determinism fails on macOS only, investigate float precision, dict ordering, or path separators — common platform-specific leaks.
  • Depends on: add generator module with happy_path, edge_case, and policy scenario categories #6 (record/replay), #020 (CLI), #021 (quickstart).
  • Blocks: all subsequent phases rely on these invariants staying green.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions