arithmetic-interp

Overview

This repository provides evaluation scripts and results for Arithmetic Length Generalization Dataset (ALGD), a benchmark designed to test the arithmetic reasoning and length generalization capabilities of large language models (LLMs). The ALGD dataset evaluates models on arithmetic tasks such as addition, subtraction, multiplication, division, and modulus operations across varying numerical complexities (1-digit to 7-digit numbers).

The ALGD dataset is hosted on Hugging Face and can be accessed here.

Repository Features

Evaluation Notebooks: Jupyter notebooks for evaluating LLMs (GPT-4o-mini, LLaMa-3.1, Gemma2) on the ALGD dataset.
Precomputed Results: CSV files containing evaluation results for multiple models across different digit complexities.
Digit-Wise Accuracy Analysis: Digit-wise accuracy evaluation.
Reproducibility: Scripts and configurations to reproduce experiments from the research report "Length Generalization of Arithmetic Performance".

Repository Structure

.
├── evaluation
│   ├── gemma-2b-Instruct-ALGD-eval.ipynb      # Evaluation notebook for Gemma 2B
│   ├── llama-3.1-70B-Instruct-ALGD-eval.ipynb # Evaluation notebook for LLaMa 70B
│   └── llama-3.1-8B-Instruct-ALGD-eval.ipynb  # Evaluation notebook for LLaMa 8B
├── results
│   ├── gemma-2b-it/                      # Results for Gemma 2B
│   ├── gpt-4o-mini/                      # Results for GPT-4o-mini
│   ├── llama-3.1-70b/                    # Results for LLaMa 70B
│   └── llama-3.1-8b/                     # Results for LLaMa 8B
├── length-generalization-of-arithmetic-performance.pdf # Research paper
├── LICENSE                               # License file
└── README.md                             # Documentation

Usage

1. Clone the Repository

git clone git@github.com:raishish/arithmetic-interp.git
cd arithmetic-interp

2. Install Dependencies

Install Python dependencies using pip:

pip install -r requirements.txt

3. Download the Dataset

The ALGD dataset is available on Hugging Face. Download it using the datasets library:

from datasets import load_dataset
dataset = load_dataset("algd")

Alternatively, visit the Hugging Face page to explore or download the dataset.

4. Run Evaluation Notebooks

Use the provided Jupyter notebooks in the evaluation/ directory to evaluate models on the ALGD dataset. For example:

jupyter notebook evaluation/gpt-4o-mini-ALGD-eval.ipynb

5. Analyze Results

Precomputed results are available in the results/ directory as CSV files. These files include overall accuracy and digit-wise accuracy for each model and task.

Results Summary

The repository includes precomputed results for several state-of-the-art LLMs:

GPT-4o-mini
LLaMa-3.1 (70B and 8B variants)
Gemma-2B

Key observations:

Models perform well on simpler tasks (e.g., 4-digit addition or 3-digit multiplication).
Accuracy declines sharply with increasing numerical complexity or unique digit counts.
Digit-wise analysis reveals that models often generate accurate first and last digits but struggle with middle digits in high-complexity tasks.

Refer to the results/ directory for detailed performance metrics.

Citation

If you use this repository or the ALGD dataset in your research, please cite:

@article{rai2024arithmetic-length-generalization-performance,
  title={Characterizing arithmetic length generalization performance in large language models},
  author={Rai, Ashish and Peddaputha, Akash and Gupta, Aman},
  year={2024},
  month={Dec},
  url={https://raishish.github.io/blog/2025/characterizing-arithmetic-length-generalization}
}

License

This project is licensed under the MIT License. See LICENSE for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

arithmetic-interp

Overview

Repository Features

Repository Structure

Usage

1. Clone the Repository

2. Install Dependencies

3. Download the Dataset

4. Run Evaluation Notebooks

5. Analyze Results

Results Summary

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
evaluation		evaluation
mech-interp-notebooks		mech-interp-notebooks
results		results
scripts		scripts
LICENSE		LICENSE
README.md		README.md
length-generalization-of-arithmetic-performance.pdf		length-generalization-of-arithmetic-performance.pdf
requirements.txt		requirements.txt

License

raishish/arithmetic-interp

Folders and files

Latest commit

History

Repository files navigation

arithmetic-interp

Overview

Repository Features

Repository Structure

Usage

1. Clone the Repository

2. Install Dependencies

3. Download the Dataset

4. Run Evaluation Notebooks

5. Analyze Results

Results Summary

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages