feat(jolt-eval): invariant and objective framework for AI-assisted development#1406
feat(jolt-eval): invariant and objective framework for AI-assisted development#1406moodlezoup wants to merge 87 commits intomainfrom
Conversation
…ement Introduces a framework for mechanically verifiable invariants and measurable objectives, enabling automated testing, fuzzing, AI red-teaming, and performance optimization of the Jolt zkVM. Invariants (binary pass/fail): - Soundness: mutated proofs must be rejected by the verifier - Verifier completeness: honest proofs must be accepted - Prover completeness: prover must not panic on valid inputs - Determinism: same inputs produce byte-identical proofs - Serialization roundtrip: serialize/deserialize preserves proof bytes - ZK consistency: prove+verify succeeds in current compilation mode Objectives (scalar measurements): - Peak RSS, prover time, proof size, verifier time - Guest cycle count, inline instruction lengths, wrapping cost - auto_optimize harness for AI-driven optimization loops Synthesis infrastructure: - Test generation from seed corpus + random Arbitrary inputs - Fuzz target code generation for libfuzzer_sys - Red-team harness with git worktree isolation - #[invariant(targets = [Test, Fuzz, RedTeam])] proc macro CLI binaries: check-invariants, measure-objectives, redteam Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs randomized fuzz inputs against all fuzzable invariants (or a specific one via --invariant). Supports --iterations, --duration time limits, --input-size, and --list to enumerate targets. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion Replaces the placeholder invoke_agent callback with a real implementation that: 1. Creates a detached git worktree for agent isolation 2. Builds a structured prompt with the invariant description, instructions, and any previous failed attempts 3. Invokes `claude -p` in the worktree with configurable model and max-turns 4. Captures the agent's analysis as the approach description 5. Cleans up the worktree after each iteration 6. Runs intensive fuzz checks (configurable via --num-fuzz) after each agent attempt to mechanically verify findings Also adds num_fuzz_per_iteration to RedTeamConfig so auto_redteam runs random fuzz inputs (not just seed corpus) after each attempt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the library-level auto_optimize function (which delegated agent invocation to the caller) with a standalone `optimize` binary that drives the full loop: 1. Measures baseline objectives 2. Creates an isolated git worktree per iteration 3. Invokes `claude -p` with a prompt describing the objectives, current best measurements, and past attempts 4. Captures the agent's diff and applies it to the real repo 5. Re-measures objectives and checks all invariants 6. Commits on improvement, reverts otherwise Supports --objectives (comma-separated filter), --hint (extra context for the agent), --iterations, --model, and --max-turns. Keeps OptimizationAttempt as a public type in objective/mod.rs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t backends Introduces `agent::AgentHarness` -- a trait abstracting over how an AI coding agent is invoked. Implementors control their own isolation strategy (worktrees, containers, API calls, etc.), so a multi-agent Codex harness, a remote API agent, or a parallel fan-out agent can all plug into the same red-team and optimize loops. Concrete changes: - New `src/agent.rs` with `AgentHarness` trait, `AgentResponse`/ `AgentError` types, `ClaudeCodeAgent` implementation, and worktree / diff-apply utilities. - `auto_redteam` now takes `&dyn AgentHarness` + `repo_dir` instead of an `FnMut` callback. Prompt construction stays inside the library. - `RedTeamConfig` trimmed to orchestration-only fields (iterations, fuzz count); model/max-turns live on the agent. - Both `bin/redteam.rs` and `bin/optimize.rs` construct a `ClaudeCodeAgent` from CLI args and pass it through the trait. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds `MockAgent` to the library with three constructors: - `always_ok(text)` -- repeats a fixed success response - `always_err(msg)` -- repeats a fixed error - `from_responses(vec)` -- drains a queue, then repeats the last entry MockAgent records every prompt it receives, accessible via `recorded_prompts()`. 21 new tests covering: - MockAgent basics: ok, err, prompt recording, repeat behavior, response queuing, diff passthrough - auto_redteam integration: no-violation path, immediate violation, fuzz-triggered violation, agent error handling, zero iterations, mixed ok/err responses - Prompt construction: invariant description inclusion, failed attempt accumulation across iterations - Trait extensibility: object safety, Arc wrapping, custom `FirstSuccessHarness` multi-agent implementation plugging directly into auto_redteam Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extracts the optimization loop from bin/optimize.rs into a testable library function `objective::optimize::auto_optimize` that takes: - `&dyn AgentHarness` for the AI agent - `&mut dyn OptimizeEnv` for side effects (measure, check invariants, apply diff, accept/reject) `OptimizeEnv` trait lets tests swap in a `MockOptimizeEnv` with controllable measurement sequences, invariant pass/fail schedules, and recorded side effects. 14 new optimization tests covering: - Accept on improvement (Minimize and Maximize directions) - Reject on regression - Reject when invariants fail despite improvement - Multi-iteration progressive improvement with mixed accept/reject - Early stop when agent produces no diff or errors - Zero iterations - Multiple objectives (any-improves-triggers-accept) - Prompt content: measurements, hint, past attempts - Diff application recording - Invariant failure mid-sequence Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… single source of truth Adds `BUILTIN_INVARIANT_NAMES` const and `names()`/`names_for_target()` methods to `SynthesisRegistry`. All four CLI binaries now derive their `--list` output and error messages from these instead of duplicating name strings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…registries Each invariant and objective module now calls `inventory::submit!` with an entry struct (`InvariantEntry` / `ObjectiveEntry`) containing the name, metadata, and a factory function. At runtime, `inventory::iter` discovers all registered entries automatically. Key changes: - `InvariantEntry` (name, targets, build fn) + `inventory::collect!` in invariant/mod.rs; each of the 6 invariant files submits one. - `ObjectiveEntry` (name, direction, build fn) + `inventory::collect!` in objective/mod.rs; each of the 7 objective files submits one. - `SynthesisRegistry::from_inventory(test_case, inputs)` replaces manual `register_invariants` functions in all binaries. - `build_objectives_from_inventory(setup, inputs)` and `measure_dyn` replace manual `build_objectives` functions. - `invariant_names()` replaces `BUILTIN_INVARIANT_NAMES` const -- names are derived from the inventory, not hardcoded. - All 5 CLI binaries simplified: no more per-binary registration boilerplate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
auto_redteam now asks the agent to produce a concrete counterexample
as a JSON code block, deserializes it into the invariant's Input type
via serde, and runs Invariant::check on that specific input.
Key changes:
- Invariant::Input gains Serialize + DeserializeOwned bounds;
Invariant::Setup gains 'static. All 6 built-in Input types derive
Serialize + Deserialize.
- DynInvariant gains three methods:
input_json_example() — serializes a seed-corpus item for the prompt
dyn_setup() — type-erased setup (Box<dyn Any>, call once)
check_json_input() — deserialize JSON → Input, run check()
- CheckJsonResult enum distinguishes Pass / Violation / BadInput
(deserialization failure) so parse errors are never mistaken for
real counterexamples.
- auto_redteam builds a prompt that includes the JSON schema example
and asks the agent to end with a ```json code block. The loop
extracts JSON via extract_json(), feeds it through check_json_input,
and returns Violation with the input_json on success.
- RedTeamResult::Violation now carries input_json alongside approach
and error.
- RedTeamConfig trimmed: num_fuzz_per_iteration removed (the loop
checks the single agent-produced input, not random fuzz).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The full text belongs in FailedAttempt.approach so callers can decide how to display it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Give ConstantObjective a configurable name and direction so multiple instances don't collide in the HashMap. test_measure_objectives now passes two objectives and asserts both values appear in the result. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…d red-team output
ClaudeCodeAgent::invoke_structured now passes --output-format json and
--json-schema to the Claude CLI, so the response is guaranteed to
conform to the schema. The CLI envelope's `structured_output` field
is extracted directly — no fragile regex/extract_json parsing needed
for agents that support it.
Key changes:
- AgentHarness gains invoke_structured(repo_dir, prompt, schema) with
a default that falls back to invoke().
- ClaudeCodeAgent overrides invoke_structured: passes the schema via
--json-schema, parses the CLI JSON envelope, extracts
structured_output (or result as fallback).
- Invariant::Input gains a JsonSchema bound (schemars 0.8); all 6
Input types derive JsonSchema.
- DynInvariant gains input_json_schema() → serde_json::Value, backed
by schemars::schema_for! in the blanket impl.
- auto_redteam builds an envelope schema
{"analysis": string, "counterexample": <Input schema>} and calls
invoke_structured. Response is parsed as the envelope first; if
that fails (agent doesn't support structured output), falls back
to extract_json on free-form text.
- Prompt updated: tells the agent to respond with {analysis,
counterexample} JSON instead of a code block.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds fuzz/Cargo.toml and one fuzz target per built-in invariant, each
a single-line call to the new `fuzz_invariant(name, data)` library
function. Adding a new fuzz target is one file:
#![no_main]
use libfuzzer_sys::fuzz_target;
fuzz_target!(|data: &[u8]| {
jolt_eval::invariant::synthesis::fuzz::fuzz_invariant("my_invariant", data);
});
Run with:
cd jolt-eval && JOLT_FUZZ_ELF=path/to/guest.elf cargo fuzz run soundness
fuzz_invariant() loads the ELF from JOLT_FUZZ_ELF, builds the
inventory-registered invariant and its setup once (LazyLock), then
interprets each fuzz input as raw JSON fed through check_json_input.
Also:
- Removes the _fuzz_check codegen from the #[invariant] proc macro
(redundant now that fuzz targets call the library function directly)
- Removes generate_fuzz_target() string-template codegen
- Adds SynthesisRegistry::into_invariants() for ownership transfer
- Tightens Invariant::Setup to Send + Sync + 'static (needed for
the LazyLock cache in fuzz_invariant)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tional Introduces a `guests` module with a fixed catalog of known guest programs (muldiv, fibonacci, sha2, sha3, collatz, alloc), each with its memory config, max trace length, and default inputs extracted from the #[jolt::provable(...)] attributes. `GuestSpec::compile()` invokes `host::Program::new(package).build()` to cross-compile the guest to RISC-V automatically. All 5 CLI binaries now accept `--guest <name>` (compiles automatically) as the primary interface, with `--elf <path>` preserved as a fallback for custom ELFs. The shared `guests::resolve_test_case()` helper handles both paths. `InvariantEntry::build` and `ObjectiveEntry::build` now take `Option<Arc<TestCase>>` / `Option<&SharedSetup>` respectively, with a `needs_guest: bool` flag. `SynthesisRegistry::from_inventory` and `build_objectives_from_inventory` skip entries that need a guest when `None` is passed. This lays the groundwork for invariants/objectives that operate on lower-level primitives (e.g. polynomial binding) and don't require a compiled guest program. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ynomial fuzz targets Remove the 6 generic proof-level invariants (soundness, completeness, determinism, serialization_roundtrip, zk_consistency) that required a guest ELF and were slow to exercise. Replace with 2 focused, guest-free invariants that fuzz GruenSplitEqPolynomial::bind against DensePolynomial reference implementations for both LowToHigh and HighToLow binding orders. Also makes JOLT_FUZZ_ELF optional so guest-free invariants work without it, uses DEFAULT_MAX_*_ADVICE_SIZE constants in guest configs, and adds the invariant/objective design spec. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace inventory::submit!/collect!/iter with explicit arrays in registered_invariants() and registered_objectives(). Each module's entry is now listed directly in its parent mod.rs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The fuzz binary was a random-input runner largely duplicating check-invariants --num-random. The libfuzzer scaffolding in jolt-eval/fuzz/ is kept for cargo-fuzz usage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Invariants enum Remove the DynInvariant trait, SynthesisRegistry, and InvariantEntry in favor of a JoltInvariants enum with match-based method dispatch. auto_redteam is now generic over I: Invariant, eliminating the need for type erasure in the red-team path and keeping tests simple. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t macro Replace the check-invariants binary with macro-generated #[test] functions. The #[invariant] macro now unconditionally generates seed_corpus and random_inputs tests, with iteration count configurable via JOLT_RANDOM_ITERS env var (default 10). Removed the unused redteam description generation from the macro. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntation Replace fuzz_invariant() function with a fuzz_invariant!() declarative macro that generates the full fuzz_target! expansion with OnceLock-based setup caching. Each fuzz target is now 3 lines. Change SplitEqBindInput to store w/rs challenge vectors directly as Vec<u128> with a manual Arbitrary impl, removing the seed-based indirection. Remove now-dead type-erasure methods (dyn_setup, check_json_input, check_arbitrary_input, CheckJsonResult) from JoltInvariants. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ro-driven Move Invariant::targets() to a separate InvariantTargets trait with a default empty-set implementation. The #[invariant] macro now accepts target lists (e.g. #[invariant(Test, Fuzz)]) and generates the InvariantTargets impl and conditionally emits #[test] blocks only when Test is listed. The fuzz_invariant! macro asserts that the invariant includes SynthesisTarget::Fuzz at init time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- agent/mod.rs: AgentHarness trait, types, apply_diff, truncate - agent/claude.rs: ClaudeCodeAgent, worktree helpers - agent/mock.rs: MockAgent for testing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Don't pass --verbose with --output-format json — it turns stdout into a JSONL event stream that can't be parsed as a single result object. Also parse the JSON envelope even on non-zero exit codes, since Claude may still produce structured_output on soft errors like max_turns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Also fix envelope parsing fallback: when extract_json finds the full envelope JSON from markdown, try parsing it as an envelope first to correctly extract the counterexample field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The soundness invariant lets a red-team agent produce a unified diff against guest-sandbox/, which is applied in-place (with RAII revert), compiled via jolt build, proved honestly, then verified against the agent's dishonest (output, panic) claim. The verifier must reject. Adds TestCase::verify_with_claims to override output bytes and panic flag independently. Patch filtering drops hunks with '..' paths to prevent sandbox escapes. Includes filter_patch unit tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Change Invariant::check to return Result<(), CheckError> where CheckError has Violation and InvalidInput variants. InvalidInput is silently skipped in fuzz targets, macro-generated tests, and run_checks; recorded as a failed attempt in redteam. Also refactor guests.rs into guests/ directory (mod.rs + catalog.rs) and add GuestMemoryConfig to SoundnessInput with validation limits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add guest-sandbox to workspace members so jolt build can find it. Fix guest to return u32 (Vec<u8> lacks Serialize in no_std). Wrap Program::build in catch_unwind to handle compilation panics. Add tests for memory config validation, patch filtering, and the full compile→prove→verify path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning This PR has more than 500 changed lines and does not include a spec. Large features and architectural changes benefit from a spec-driven workflow. If this PR is a bug fix, refactor, or doesn't warrant a spec, feel free to ignore this message. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
False positive — libfuzzer-sys is used via the fuzz_invariant! macro which expands to libfuzzer_sys::fuzz_target!. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each OptimizationObjective now declares diff_paths() specifying which file paths an optimizer may modify. ObjectiveFunction::diff_scope() computes the union of all input objectives' paths as DiffScope::Include. - Static analysis + binding objectives: jolt-core/ - NaiveSortTime: jolt-eval/src/sort_targets.rs - Remove diff_scope from OptimizeConfig (now derived automatically) - Fix unwrap() → unwrap_or(f64::INFINITY) in const ObjectiveFunctions Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add description() to StaticAnalysisObjective, PerformanceObjective, OptimizationObjective
- Optimize prompt: add "What you are optimizing" section with descriptions and units
- Optimize prompt: show units alongside measurement values
- Optimize prompt: include first 500 chars of each past attempt's diff
- Optimize prompt: add anti-repetition steering after previous attempts
- Optimize prompt: derive targeted reading guidance from objective's diff_paths
- Redteam prompt: always include JSON schema (not just when seed example exists)
- Redteam prompt: truncate approach text to 200 chars in previous attempts
- Redteam prompt: show "Iteration {current}/{total}" counter
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add description() with default (returns name()) to the Objective trait. Each concrete type now owns its description, and the enums delegate instead of hardcoding strings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the manual match statement in the redteam binary with a red_team() method on JoltInvariants that uses the dispatch! macro, consistent with name(), description(), targets(), and run_checks(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…normalization Objectives have different units and magnitudes (e.g. LLOC ~5500 lines vs cognitive complexity ~4.0), making weighted combinations unwieldy. - Add baseline() to StaticAnalysisObjective, PerformanceObjective, and OptimizationObjective returning a reference scale - Add normalized(obj, measurements) free function: value / baseline, yielding a dimensionless ratio where 1.0 ≈ current typical value - Add test demonstrating a balanced 50/50 weighted composite that treats a 10% improvement in any objective equally Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded baseline() values with dynamic baselines captured from the first env.measure() call at the start of auto_optimize. - ObjectiveFunction.evaluate now takes (measurements, baselines) - normalized(obj, measurements, baselines) divides by the baseline measurement instead of a hardcoded constant - Remove baseline() from all objective enums - auto_optimize passes initial measurements as baselines to evaluate - Single-objective functions ignore baselines (use |m, _|) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a new jolt-eval crate that provides mechanically checkable invariants and measurable objectives for AI-assisted red-teaming and iterative optimization of the Jolt zkVM codebase, with supporting CLIs and benchmark/fuzz synthesis tooling.
Changes:
- Adds an invariant framework (incl.
#[invariant(...)]proc macro) plus red-team and fuzz harnesses. - Adds an objective framework spanning static analysis and Criterion-backed performance measurement, plus an optimization loop (
optimize). - Adds end-to-end sort targets/tests and a
sync_targets.shhelper to keep fuzz/bench targets in sync.
Reviewed changes
Copilot reviewed 50 out of 55 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| jolt-eval/sync_targets.sh | Generates/removes fuzz targets and syncs bench entries in Cargo manifests. |
| jolt-eval/src/sort_targets.rs | Adds naive and intentionally-buggy sort functions as e2e targets. |
| jolt-eval/src/sort_e2e.rs | Implements red-team + optimize e2e harnesses and tests around sorting. |
| jolt-eval/src/objective/synthesis.rs | Adds bench_objective! macro to generate Criterion harnesses. |
| jolt-eval/src/objective/performance/prover_time.rs | Adds a prover-time objective implementation for guest programs. |
| jolt-eval/src/objective/performance/mod.rs | Exposes performance objective modules. |
| jolt-eval/src/objective/performance/binding.rs | Adds polynomial binding performance objectives and keys. |
| jolt-eval/src/objective/optimize.rs | Implements iterative AI optimization loop with git accept/reject behavior. |
| jolt-eval/src/objective/objective_fn/mod.rs | Adds registered ObjectiveFunctions and diff scoping. |
| jolt-eval/src/objective/mod.rs | Defines Objective traits and type-safe objective keys/enums. |
| jolt-eval/src/objective/code_quality/mod.rs | Exposes static-analysis objective modules. |
| jolt-eval/src/objective/code_quality/lloc.rs | Implements LLOC static analysis objective via rust-code-analysis. |
| jolt-eval/src/objective/code_quality/halstead_bugs.rs | Implements Halstead “bugs” static analysis objective. |
| jolt-eval/src/objective/code_quality/cognitive.rs | Implements cognitive complexity static analysis objective. |
| jolt-eval/src/lib.rs | Adds crate root exports and module wiring. |
| jolt-eval/src/invariant/synthesis/redteam.rs | Implements structured JSON red-team loop and prompt construction. |
| jolt-eval/src/invariant/synthesis/mod.rs | Exposes invariant synthesis submodules. |
| jolt-eval/src/invariant/synthesis/fuzz.rs | Adds fuzz_invariant! macro to create libFuzzer targets. |
| jolt-eval/src/invariant/split_eq_bind.rs | Adds split-eq binding invariants with Test/Fuzz/RedTeam targets. |
| jolt-eval/src/invariant/soundness.rs | Adds a soundness red-team invariant using sandbox patching + prove/verify. |
| jolt-eval/src/invariant/mod.rs | Defines core invariant traits, error types, and invariant registry/dispatch. |
| jolt-eval/src/invariant/macro_tests.rs | Adds invariants used to test macro synthesis behavior. |
| jolt-eval/src/guests/sha2_chain.rs | Adds SHA-2 chain guest config for prover-time benchmarking. |
| jolt-eval/src/guests/secp256k1_ecdsa.rs | Adds secp256k1 ECDSA verify guest config for benchmarking. |
| jolt-eval/src/guests/mod.rs | Adds guest plumbing for proving/verifying and guest config trait. |
| jolt-eval/src/guests/fibonacci.rs | Adds Fibonacci guest config for benchmarking. |
| jolt-eval/src/agent/tests.rs | Adds extensive tests for agent harness + optimize/redteam loops. |
| jolt-eval/src/agent/mod.rs | Adds agent harness trait, diff scoping, and diff application helper. |
| jolt-eval/src/agent/mock.rs | Adds a mock agent for deterministic testing. |
| jolt-eval/src/agent/claude.rs | Adds a Claude Code CLI-backed agent using isolated git worktrees. |
| jolt-eval/README.md | Documents the jolt-eval concepts, invariants, objectives, and CLIs. |
| jolt-eval/macros/src/lib.rs | Implements #[invariant(...)] proc macro (targets + optional tests). |
| jolt-eval/macros/Cargo.toml | Declares the proc-macro crate and dependencies. |
| jolt-eval/guest-sandbox/src/main.rs | Adds guest sandbox entrypoint. |
| jolt-eval/guest-sandbox/src/lib.rs | Adds default sandbox provable function for soundness red-teaming. |
| jolt-eval/guest-sandbox/README.md | Documents the sandbox and patch filtering safety model. |
| jolt-eval/guest-sandbox/Cargo.toml | Declares the sandbox guest package. |
| jolt-eval/guest-sandbox/.gitignore | Ignores sandbox build output. |
| jolt-eval/fuzz/fuzz_targets/split_eq_bind_low_high.rs | Adds fuzz target for split-eq bind Low→High invariant. |
| jolt-eval/fuzz/fuzz_targets/split_eq_bind_high_low.rs | Adds fuzz target for split-eq bind High→Low invariant. |
| jolt-eval/fuzz/Cargo.toml | Adds fuzz workspace crate and bin targets for fuzzing invariants. |
| jolt-eval/fuzz/.gitignore | Ignores fuzz build artifacts/corpus. |
| jolt-eval/Cargo.toml | Declares jolt-eval crate deps, bins, and bench targets. |
| jolt-eval/bin/redteam.rs | Adds redteam CLI for AI red-team sessions. |
| jolt-eval/bin/optimize.rs | Adds optimize CLI for AI optimization runs against objective functions. |
| jolt-eval/bin/measure_objectives.rs | Adds measure-objectives CLI for static + perf objective measurement. |
| jolt-eval/benches/prover_time_sha2_chain.rs | Adds Criterion bench harness for prover time (SHA2 chain). |
| jolt-eval/benches/prover_time_secp256k1_ecdsa_verify.rs | Adds Criterion bench harness for prover time (ECDSA verify). |
| jolt-eval/benches/prover_time_fibonacci.rs | Adds Criterion bench harness for prover time (Fibonacci). |
| jolt-eval/benches/bind_parallel_low_to_high.rs | Adds Criterion bench harness for binding (Low→High). |
| jolt-eval/benches/bind_parallel_high_to_low.rs | Adds Criterion bench harness for binding (High→Low). |
| Cargo.toml | Adds new workspace members and excludes fuzz sub-workspace. |
| .gitignore | Ignores .omc/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…tterns - Delete all section separator comments (// ===, // ---, // ──) - Simplify StaticAnalysisObjective::diff_paths (all variants return the same value, no match needed) - Extract read_criterion_estimate into objective::performance with baseline parameter, removing duplicate in both binaries - Replace manual diff_paths dedup in prompt builder with objective.diff_scope() call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The SortOptimizeEnv previously used an in-process function pointer swap to simulate optimization (can't recompile at runtime). Now it shells out to `cargo run --bin optimize -- --measure`, which recompiles the modified sort_targets.rs and measures the actual modified code. - Add --measure flag to optimize binary: measures objective inputs and prints JSON to stdout - Register MINIMIZE_NAIVE_SORT_TIME as a proper ObjectiveFunction - Add measure_perf() helper for per-variant perf measurement (Criterion for bindings, direct timing for NaiveSortTime) - SortOptimizeEnv::measure() shells out to the binary - SortOptimizeEnv::apply_diff() just patches the file (no fn swap) - Replace mock tests with MockEnv (predetermined measurements) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the NaiveSortTime fieldless variant and SortOptimizeEnv subprocess hack with a standard Criterion-based performance objective: - Add NaiveSortObjective in sort_targets.rs implementing Objective with setup()/run() — same pattern as BindLowToHighObjective - Add benches/naive_sort_time.rs using bench_objective! macro - NaiveSortTime now wraps NaiveSortObjective (data-carrying variant) - Include NaiveSortTime in PerformanceObjective::all() - Delete SortOptimizeEnv, run_optimize_test, and mock optimize tests from sort_e2e.rs (redundant with agent/tests.rs) - Rewrite --test in optimize binary to use RealEnv directly - RealEnv::measure() handles naive_sort via cargo bench like all other perf objectives — no special-casing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
…--measure Move NaiveSortObjective from sort_targets.rs (which the optimizer agent can modify) to objective/performance/naive_sort.rs (which it cannot). This prevents the agent from cheating by modifying the objective itself. Also remove --measure flag from optimize binary — no longer needed now that the sort objective is a proper Criterion benchmark measured via RealEnv like all other perf objectives. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sage The error reported input/output/stack<=MAX_STACK_SIZE but input and output have their own limits (MAX_INPUT_SIZE, MAX_OUTPUT_SIZE). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix ObjectiveFunction example: update evaluate closure signature to |m, _b| and add missing `normalized` to the use statement - Clarify that prover_time_* benchmarks are standalone Criterion targets not tracked by optimize/measure-objectives binaries - Update guest-sandbox/README to reflect actual behavior: hash guest, in-place patching with PatchGuard git-checkout revert - Add git clean -fd after git checkout . in both rejection branches of auto_optimize so agent-added untracked files don't contaminate subsequent iterations - Capture stderr in apply_diff and include it in the AgentError message Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The harness applies patches in-place with a PatchGuard RAII guard (not copying to a temp dir), and the default guest is a wrapping hash (not an identity function). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ctory
Each auto-optimize iteration now writes diff.patch, response.md,
measurements.json, and status.json to jolt-eval/optimize-history/{objective}/{attempt-N}/.
The baseline is persisted to attempt-0/baseline/. The prompt's previous-attempts
section is slimmed down to one line per attempt pointing at the directory.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tory
Mirror the optimize-history pattern for the redteam loop:
- Write approach.md and failure_reason.txt to
jolt-eval/redteam-history/{invariant}/attempt-{N}/
- Add path: Option<String> to FailedAttempt
- Slim the prompt to show description + failure reason inline,
with a pointer to the attempt directory for the full approach
- Add redteam-history/ to .gitignore
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Code quality objectives now store target_dir (e.g. "jolt-core/src")
instead of a leaked root PathBuf. The repo root is computed at runtime
via env!("CARGO_MANIFEST_DIR").parent(). This eliminates all Box::leak
calls from the code quality objectives and removes the root parameter
from StaticAnalysisObjective::all() and OptimizationObjective::all().
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Allows computed descriptions that include target_dir, e.g.: "Total logical lines of code in jolt-core/src/" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Required by the simple form of bench_objective! which calls <Type>::default(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
#[invariant(Test, Fuzz, RedTeam)]macro synthesizes tests, fuzz targets, and AI red-team harnesses from a single invariant definitionObjectivetrait for both static-analysis (LLOC, cognitive complexity, Halstead bugs) and Criterion-benchmarked performance measurementsconst-constructible struct combining objectives into a scalar for the optimizer to minimize; registered viaObjectiveFunction::all()/by_name()OptimizationObjectiveenum replaces string-keyed HashMaps; three nested enums (StaticAnalysisObjective,PerformanceObjective,OptimizationObjective) with discriminant-based Hash/Eqredteam): runs Claude in an isolated worktree to find invariant violations via structured JSON counterexamplesoptimize): iterative optimization loop — agent proposes changes, diff is applied, objectives re-measured, invariants checked, accepted/reverted with git branch + per-iteration commits--list,--test(built-in e2e sort targets),--verbose(shows prompts/responses), mutual exclusivity via clapnaive_sortoptimization target,candidate_sortred-team target with subtle bug (correct for ≤16 elements, skips last element for larger arrays), both MockAgent and#[ignore]real-agent testsTest plan
cargo nextest run -p jolt-eval— 87 tests pass, 2 skipped (real-agent#[ignore]tests)cargo clippy -p jolt-eval --all-targets— zero warningscargo run --release -p jolt-eval --bin optimize -- --list— lists 5 objective functionscargo run --release -p jolt-eval --bin redteam -- --list— lists red-teamable invariants + e2e targetscargo run --release -p jolt-eval --bin optimize -- --test --verbose— e2e sort optimization with real Claude agentcargo run --release -p jolt-eval --bin redteam -- --test --verbose— e2e sort red-team with real Claude agentcargo bench -p jolt-eval— Criterion benchmarks run🤖 Generated with Claude Code