A complete Python benchmarking and evaluation framework for LLM systems. Supports GSM8K, ARC, CMT, and custom datasets with grading, visualization, and PDF report generation.
- Prompt and dataset loading (JSONL/ Hugging Face)
- Automated evaluation runners with concurrency
- Grading and metric calculation
- Visualization (matplotlib) and PDF reports (reportlab + pypdf)
- CMT extraction from PDFs
- Reproducible benchmark pipelines
-
Clone and install
cd syntra-testing-refactor python3 -m venv venv source venv/bin/activate # or venv\Scripts\activate on Windows pip install -r requirements.txt pip install -e .
-
Configure (optional)
export OPENAI_API_KEY="sk-..." export HF_TOKEN="hf_..." # for datasets mkdir -p data/ runs/
-
Run a benchmark
# Run full evaluation python -m src.syntra_testing.runners.eval_runner --dataset gsm8k --output runs/gsm8k/ # Generate visualizations and PDF report python -m src.syntra_testing.tools.visualization.viz_hf_cmt --input runs/ --output runs/report.pdf
-
Run tests
pytest tests/ -q
src/syntra_testing/: Core packageTools/: Benchmark and visualization toolsprompts/: Prompt templatesbenchmarks/: Configuration and resultsruns/: Output directory (gitignored)
Set environment variables for providers:
OPENAI_API_KEYANTHROPIC_API_KEYHF_TOKEN(for datasets)
See FIXES.md for troubleshooting.
MIT