Fail-Closed Security Capability Evaluation Framework
Protocolized intelligence · evidence-first evaluation · execution-grade hardening · complete community edition
Live Index · System Map · Protocol Explorer · Benchmark Dashboard · Doctrine Index
A complete community edition of an audit-grade evaluation framework for security roles and frontier prompt architecture. It includes:
- role protocols
- an execution-grade GPT5.4 hardening layer
- machine-readable schemas
- runnable examples
- a synthetic benchmark demo with manifests, logs, and metric scripts
- Pages-ready visual presentation
- licensing and commercial-permissions templates
This repository is a runnable demonstration framework with executable examples, schemas, tools, and licensing boundaries already in place.
Every existing security capability evaluation framework has the same flaw: it fails open.
Missing evidence → evaluator fills the gap with narrative judgment. Confident-sounding candidate → scores drift upward. Self-attested credentials → treated as evidence.
This repository is the correction.
Six protocols. Fail-closed gates. Evidence Confidence tiering that caps scores when artifacts can't be independently verified. Anti-gaming enforcement that detects artifact reuse and self-referential approval loops. Prompt injection resistance that treats artifact content as data, not instructions.
The first system where the evaluation cannot be fooled by the candidate being better at communicating than at security engineering.
Community tier:
LICENSE.txt/LICENSE.md— Audit-Grade Community License 1.0 (AGCL-1.0)
Commercial tier:
SOVEREIGN_TERMS.mdCOMMERCIAL_LICENSE_AGREEMENT_TEMPLATE.mdNOTICE.mdTRADEMARK_USAGE_GUIDELINES.md
The community tier is restrictive by design: non-commercial sharing of complete unmodified copies only; no commercial use; no white-labeling; no AI training / embedding / RAG ingestion.
python tools/reference_runner.py examples/worked-example/SE_WORKED_EXAMPLE_INPUT.json --output examples/worked-example/SE_WORKED_EXAMPLE_OUTPUT.json
python tools/validate_json.py schemas/evaluation-result.schema.json examples/worked-example/SE_WORKED_EXAMPLE_OUTPUT.jsonpython benchmark/benchmark_runner.pyOutputs are written to:
examples/worked-example/benchmark/results/benchmark/metrics.jsonbenchmark/logs/demo_benchmark_run.log
.
├── index.html
├── system-map.html
├── protocol-explorer.html
├── benchmark-dashboard.html
├── doctrine.html
├── execution-layer.html
├── LICENSE.txt
├── LICENSE.md
├── NOTICE.md
├── COMMERCIAL_LICENSE_AGREEMENT_TEMPLATE.md
├── TRADEMARK_USAGE_GUIDELINES.md
├── CHANGELOG.md
├── ROADMAP.md
├── VERSION
├── robots.txt
├── protocols/
├── schemas/
├── execution/
├── examples/
├── benchmark/
├── docs/
├── tools/
├── self-evaluation/
└── assets/
The repository includes a full end-to-end worked example under examples/worked-example/:
- input bundle
- output result
- evidence trace CSV
- worked-example notes
The included benchmark is a synthetic harness-validation benchmark. It proves completeness, reproducibility, and execution path for the framework itself. It does not claim independently verified real-world superiority over frontier models.
MANIFEST.jsonprovides SHA-256 fingerprints for repository filestools/verify_manifest.pyverifies repository integrityschemas/define machine-readable contractsexecution/documents the deterministic run path
Deterministic self-evaluation artifacts live under self-evaluation/:
self-evaluation/kriterion-self-eval-bundle.jsonself-evaluation/kriterion-self-eval-result.jsonself-evaluation/REPRODUCIBILITY.md
Local verification:
python tools/validate_json.py schemas/reference-input-bundle.schema.json self-evaluation/kriterion-self-eval-bundle.json
python tools/validate_json.py schemas/evaluation-result.schema.json self-evaluation/kriterion-self-eval-result.json
python tools/verify_execution_chain.py --input self-evaluation/kriterion-self-eval-bundle.json --result self-evaluation/kriterion-self-eval-result.json
python tools/reference_runner.py self-evaluation/kriterion-self-eval-bundle.json --output /tmp/kriterion-self-eval-result.repro.json
diff -u self-evaluation/kriterion-self-eval-result.json /tmp/kriterion-self-eval-result.repro.jsonThe repository interface uses a constrained black / white / signal-red visual system, documented in:
docs/REPO_VISUAL_SYSTEM_2026.mddocs/DESIGN_SOURCES_2026.md
The design is intentionally severe, high-contrast, and non-template.
Core business documents:
METHODOLOGY.mdTHREAT_MODEL_FOR_AI_EVALUATION.mdSOVEREIGN_TERMS.mdFIELD_REPORT.mdCOGNITIVE_FINGERPRINT.md
Commercial and go-to-market documents:
business/LINKEDIN_POST_01_PRIMARY.mdbusiness/LINKEDIN_POST_02_EVIDENCE_CONFIDENCE.mdbusiness/LINKEDIN_POST_03_ANTI_GAMING.mdbusiness/LINKEDIN_POST_04_UKRAINE_CONTEXT.mdbusiness/GUMROAD_PRODUCT_LISTING.mdbusiness/COMMERCIAL_INQUIRY_EMAIL.mdbusiness/GITHUB_METADATA.mdbusiness/TIMESTAMP_PRIORITY_EVIDENCE_BRIEF.md