Minimal structure around Codex to make autonomous development predictable.
The human defines the target: a feature spec (docs/features/<id>/FEATURE.md), constraints, and what “done” means at a high level. When asked to generate acceptance coverage, Codex translates behavior described in FEATURE.md (including any Gherkin scenarios) into executable black-box tests under the feature folder. Those tests become the concrete definition of success.
Codex then implements the necessary code, updates tests as needed, runs commands, fixes failures, and iterates until all checks pass. Correctness is enforced by scripts, not by the model.
AGENTS.md defines how Codex operates inside a repository: which files are authoritative, how specs are interpreted, and how optional project architecture (docs/ARCHITECTURE.md) is applied if present.
Skills provide reusable architectural patterns (for example, python-backend or frontend). They enforce cross-project consistency without polluting individual repos.
Determinism comes from two scripts. These scripts decide success.
$HOME/.codex/scripts/gate runs repo-wide checks such as linting, tests, and builds.
$HOME/.codex/scripts/acceptance --feature <dir> runs feature-scoped black-box tests derived from FEATURE.md.
Orchestration in this repo is script-driven (scripts/gate, scripts/acceptance, scripts/ensure_venv). No in-repo press.py wrapper is currently present.
Specs define intent. Acceptance tests define “done.” Architecture defines structure. Scripts define correctness. Codex produces code.
AGENTS.md(repo rules): source of truth for workflow, TDD, quality bar, and required checks.config.toml(runtime mode): model, sandbox, approvals, and default search mode.skills/*/SKILL.md(task playbooks): specialized workflows for feature writing, fixing issues, backend/frontend, and architecture reviews.prompts/*.md(phase prompts): templates for feature implementation, gate-fix, and acceptance-fix tasks.scripts/*(deterministic validators): these decide pass/fail, not the model.
- Default: safe and reproducible
sandbox_mode = "workspace-write"web_search = "disabled"
- Per-run overrides when needed:
- live search:
codex --search "..." - temporary sandbox override:
codex -s danger-full-access "..." - temporary config override:
codex -c web_search=\"live\" "..."
- live search:
Example:
codex
codex --search "your prompt here"scripts/ensure_venv- Creates/repairs
.venv, installs baseline Python tooling, validates required tools.
- Creates/repairs
scripts/gate- Runs repo-wide checks.
- Python path:
ruff format --check,ruff check,mypy,pytest. - Runs
pip-auditonly when dependency manifest files changed.
scripts/acceptance --feature <FEATURE_DIR>- Runs feature-scoped acceptance checks.
- Fails if
FEATURE.mdis missing. - Fails if acceptance harness is missing.
feature: create/updatedocs/features/<id>/FEATURE.mdfeature-discovery: quick breadth-first discovery before codingfix-issue: minimal corrective change with regression checkspython-backend: backend layering + pytest disciplinefrontend: React/Next.js UI implementation disciplinearchitecture-deep-dive: deep component architecture analysiscompare-architectures: compare current architecture vs referencemaintainability-review: codebase maintainability and function-quality audit
- Define feature directory and spec:
mkdir -p docs/features/<feature-id>
$EDITOR docs/features/<feature-id>/FEATURE.md- Prepare toolchain:
./scripts/ensure_venv- Implement with Codex (default mode):
codex "Implement docs/features/<feature-id>/FEATURE.md"- Validate repo:
$HOME/.codex/scripts/gate- Validate feature:
$HOME/.codex/scripts/acceptance --feature docs/features/<feature-id>- If a check fails, run a focused fix pass:
codex "Fix gate failures only. Do not add scope."
codex "Fix acceptance for docs/features/<feature-id> only."- Feature behavior is documented in
FEATURE.mdwith clear Gherkin scenarios. - Red/green evidence exists for the changed behavior.
gatepasses.- feature
acceptancepasses. - Final handoff ends with
READY.
This structure builds on ideas from:
- The Ralph Wiggum technique (looping an agent until it converges): https://ghuntley.com/ralph/
- Spec-driven development (spec as authoritative source of intent): https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html
- “Shipping at Inference Speed” by Peter Steinberger (structural autonomy): https://steipete.me/posts/2025/shipping-at-inference-speed
- OpenAI’s Harness engineering writeup (environment + feedback loops as the core system): https://openai.com/index/harness-engineering/
The implementation here is minimal, but the direction is informed by these approaches.