This repository contains scripts for evaluating large language models (LLMs) on LeetCode-style coding problems. The workflow consists of generating solutions, organizing them into directories, executing them, and calculating pass@k metrics to assess model performance.
Ensure you have the following installed:
- Python 3.10+
- Required dependencies:
pip install datasets transformers vllm numpy pandas matplotlib torch==2.3.1 \ transformers==4.45.0 vllm==0.5.3.post1 datasets==2.21.0 wandb==0.17.6 \ numpy==1.26.4 pandas==2.2.2 matplotlib==3.9.2 scipy==1.13.1 \ argparse==1.4.0 tqdm==4.66.5 jinja2==3.1.4
- A compatible LLM, such as
google/gemma-2-2b-it.
The scripts expect a structured directory setup:
.
├── setup.py # Creates directories and populates them with Python files
├── justgen.py # Generates model solutions for LeetCode problems
├── timeleet.py # Runs solutions and evaluates pass@k
├── EvalQs/qeval.json # JSON file containing evaluation questions
├── res/ # Directory where results are stored
Run setup.py to create required directories and fill them with initial Python files:
python3 setup.py -srcdir <source_directory> --dst <destination_directory> --num_samples 100<source_directory>: Path to the directory containing initial Python templates.<destination_directory>: Where the files will be copied.--num_samples: Number of sample folders to create (default: 10).
Use justgen.py to generate solutions for 100 LeetCode problems:
python3 justgen.py --model_name google/gemma-2-2b-it --max_model_len 1500 --outputdir res/eval100 --numsamples 100 --temp 0.7--model_name: Specifies the LLM to use.--max_model_len: Sets the max token length.--outputdir: Defines the output directory for generated solutions.--numsamples: Number of problems to solve.--temp: Temperature setting for the LLM.
Use timeleet.py to run and test generated solutions:
python3 timeleet.py --dirs res/eval100 --samples 100--dirs: Directory containing generated solutions.--samples: Number of problems to evaluate.
After running the solutions, pass@k values are automatically computed and stored in a CSV file:
res/eval100_results.csvThis file contains:
- Pass@1, Pass@5, Pass@10, Pass@20 values
- List of files that passed and failed execution
- Higher pass@k values indicate that the model generates more correct solutions.
- Lower pass@k values suggest that improvements (e.g., prompt engineering, fine-tuning) may be necessary.
python3 setup.py -srcdir starter_code --dst samples --num_samples 100
python3 justgen.py --model_name google/gemma-2-2b-it --max_model_len 1500 --outputdir res/eval100 --numsamples 100 --temp 0.7
python3 timeleet.py --dirs res/eval100 --samples 100The final results will be stored in res/eval100_results.csv.
This pipeline automates the evaluation of LLMs on coding problems. You can modify model parameters or prompts to optimize performance.