LLM Evaluation on JEE Advanced 2025 Question Papers
jeeBench benchmarks Large Language Models (LLMs) on Joint Entrance Examination (JEE) Advanced 2025 papers. It extracts questions from PDFs, evaluates them with multiple AI providers (Anthropic, OpenAI, Google), and generates performance analytics.
pip install -r requirements.txtCopy .env.local → .env file:
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
GOOGLE_API_KEY=your_key_heredata/
├── inputs/
│ ├── question_papers/ # Place JEE PDF papers here
│ ├── syllabus/ # jee_syllabus.json
│ └── scoring/ # jee_2025_scoring.json
└── outputs/ # Results will be generated here
- Extract → PDF papers become structured question data
- Solve → AI models answer questions, get scored using JEE rules
- Analyze → Generate comparative performance reports
Output: Excel file with complete model comparison and AIR rankings based on human topper scores.
python 01_extract_questions_to_json.py- Processes PDF papers in
data/inputs/question_papers/ - Creates question metadata and images
- Output:
question_metadata_jee_2025.json
# Single model
python 02_solve_questions.py --provider anthropic --model claude-sonnet-4-20250514
# All models from a provider
python 02_solve_questions.py --provider openai --model all
# All providers and models
python 02_solve_questions.py --provider all
# Faster parallel processing
python 02_solve_questions.py --provider anthropic --parallel- Sends questions to AI models
- Evaluates responses using JEE scoring rules
- Output: Individual result JSON files per model
python 03_consolidate_jee_results.py- Consolidates all model results
- Creates comprehensive Excel report with:
- Overall model comparison with AIR rankings
- Subject-wise performance (Physics, Chemistry, Math)
- Unit-wise analysis
- Cost efficiency metrics
- Anthropic: Claude Sonnet 4, Claude Opus 4
- OpenAI: GPT-4o, GPT-4.1, o3, o4-mini, GPT-5
- Google: Gemini 2.5 Flash, Gemini 2.0 Flash, Gemini 2.5 Pro
- xAI: Grok 4
- Question Types: MCQ, Multiple Correct, Numerical, Pair Matching
- JEE Scoring: Official scoring rules with partial credit
- Analytics: Performance by subject/unit/difficulty + cost tracking
- Parallel Processing: Faster evaluation with
--parallelflag
Contributions to expand jeeBench! Here are areas where you can help:
- Older JEE papers
- NEET (National Eligibility cum Entrance Test)
- GATE (Graduate Aptitude Test in Engineering)
- CAT/GMAT (Management entrance exams)
- SAT/GRE (International standardized tests)
- Meta: Llama 3.2/3.3 models, Llama 3.2
- Mistral: Mistral Large, Mixtral 8x7B
- Sarvam-M
- Fractal Fathom-R1-14B
- Other Models: Qwen, DeepSeek, etc.
- Dhanishtha-2.0-preview
Fork the repository, make your changes, and submit a pull request!
