This workspace implements a production-oriented starter for a hybrid Commercial Due Diligence (CDD) RL environment using Prime Intellect patterns:
- Environment package under
environments/cdd_hybrid/ - Data pipeline under
scripts/ - Core validation and metrics under
src/cdd_prime/ - Reproducible configs under
configs/ - Tests under
tests/
This project provides a practical foundation for training and evaluating LLM agents on CDD-style reasoning with strict anti-leakage controls and measurable decision quality. It is designed for Prime Intellect workflows (verifiers, prime eval, and prime-rl) while staying runnable locally.
Train/evaluate models to perform CDD-like reasoning with two reward families:
- Process quality: workstream coverage, evidence usage, internal consistency, output format quality.
- Outcome quality: calibrated probability prediction and decision utility against realized outcomes.
- Validate toolchain and Prime OpenAPI contracts (
scripts/check_toolchain.py,scripts/check_prime_openapi_contract.py). - Build a historical deal universe (
scripts/expand_deals_from_wikipedia.py,scripts/merge_deal_sources.py). - Enrich with realized outcomes using market data (
scripts/build_outcomes.py). - Ingest richer text evidence snippets (
scripts/enrich_text_evidence.py). - Build pre-deal-only packets and prompts (
scripts/build_packets.py). - Split and validate (
scripts/split_dataset.py,scripts/validate_dataset.py). - Run baseline and evaluate (
scripts/run_heuristic_baseline.py,scripts/evaluate_predictions.py,scripts/evaluate_group_policy.py). - Run memorization probes and judge scoring (
scripts/run_memorization_probe.py,scripts/run_model_judge.py). - Use
environments/cdd_hybrid/withprime env install+prime eval run.
set -a; source .env.local; set +a
./scripts/smoke_pipeline.shset -a; source .env.local; set +a
./scripts/full_pipeline.shThis runs:
- Toolchain lock checks
- Prime OpenAPI contract checks
- Wikipedia expansion + source merge
- Outcome enrichment + rich text evidence ingestion + packetization
- Split + leakage validation + baseline metrics + group metrics + regression gate + tests
Pinned versions are in toolchain.lock.toml.
Validate:
python3 scripts/check_toolchain.py- Environment source:
environments/cdd_hybrid/cdd_hybrid.py - Environment package metadata:
environments/cdd_hybrid/pyproject.toml - Eval config template:
configs/eval/cdd_hybrid.toml - Prime-RL config template:
configs/prime-rl/cdd_hybrid.toml
On the currently installed stack (prime 0.4.x + verifiers 0.1.5), local evaluation is run with vf-eval:
PRIME_API_KEY=... vf-eval cdd_hybrid \
-m qwen/qwen3-235b-a22b-instruct-2507 \
-b https://api.pinference.ai/api/v1 \
-k PRIME_API_KEY \
-n 2 -r 1 \
-a '{"dataset_path":"data/processed/train.jsonl","eval_dataset_path":"data/processed/test.jsonl"}' \
-sSmall-model matrix benchmark:
set -a; source .env.local; set +a
python3 scripts/run_model_matrix.py --dataset data/processed/test.jsonl --limit 24Outputs:
data/interim/model_matrix/benchmark_report.mddata/interim/model_matrix/benchmark_summary.json- Tracked copies:
reports/benchmark_report.md,reports/benchmark_summary.json
Optional multi-sample pass@k run:
set -a; source .env.local; set +a
python3 scripts/run_model_matrix.py \
--dataset data/processed/test.jsonl \
--models qwen/qwen3-8b \
--limit 12 \
--samples-per-deal 3 \
--temperature 0.1 \
--out-dir data/interim/model_matrix_passkThree-seed short optimization experiments:
set -a; source .env.local; set +a
python3 scripts/run_seed_optimization.py --dataset data/processed/train.jsonl --eval-dataset data/processed/test.jsonlOutputs:
data/interim/seed_optimization/seed_results.jsondata/interim/seed_optimization/seed_summary.json- Tracked copy:
reports/seed_summary.json
Blinded process-quality judge (dry-run heuristic or online model judge):
python3 scripts/run_model_judge.py \
--dataset data/processed/test.jsonl \
--predictions data/interim/model_matrix/qwen-qwen3-8b.jsonl \
--dry-run \
--output data/interim/model_judge_results.jsonl \
--summary-output reports/model_judge_summary.jsonGroup metrics from multi-sample predictions:
python3 scripts/evaluate_group_policy.py \
--dataset data/processed/test.jsonl \
--predictions data/interim/heuristic_predictions_test.jsonl \
--k-values 1 \
--output data/interim/group_metrics.json
cp data/interim/group_metrics.json reports/group_metrics.json- The current dataset is expanded from public acquisition list pages plus a curated seed set.
- Time-split evaluation and leakage checks are enforced to reduce label contamination risk.
- Memorization probing is included and can be run in
--dry-runor online mode. - Judge rubric scoring is blinded to realized outcomes by design.
- Public-source extraction quality depends on citation availability and page structure.
- For production, enrich evidence from issuer filings/transcripts and internal data room documents.
- Reward design is modular; adjust weights and thresholds for your IC loss function.