This repository contains the code and data for the paper:
EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, Pang Wei Koh Preprint.
- ๐ Paper
- ๐พ Code & Data
- ๐ Web Interface (Interactive Demo of Capability Trees)
If you find our work useful, please consider citing:
@article{zeng2025evaltree,
title={EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees},
author={Zeng, Zhiyuan and Wang, Yizhong and Hajishirzi, Hannaneh and Koh, Pang Wei},
journal={arXiv preprint 2503.08893},
year={2025}
}If you have any questions about the code or the paper, feel free to contact Zhiyuan Zeng (zhiyuan1zeng@gmail.com or zyzeng@cs.washington.edu).
If you encounter any issues while using the code or want to report a bug, please open an issue. When reporting a problem, provide detailed information so we can assist you more effectively.
Install the required dependencies.
pip install -r requirements.txtSet up your keys and ensure that your OpenAI and Hugging Face credentials are correctly configured before running the code.
export OPENAI_API_KEY="your_openai_api_key"
export HF_TOKEN="your_huggingface_access_token"A model's evaluation result on a benchmark is stored in
Datasets/BENCHMARK/eval_results/real/MODEL/results.json.
This is the performance metric vector, as all weakness profiling methods operate on performance metrics rather than original generation results.
For instruction-following benchmarks, MODEL takes the form [MODEL1]BEAT[MODEL2], indicating it is the preference label vector determined by LM-as-a-judge comparing MODEL1 and MODEL2. In this case, 1 means MODEL1 is preferred, and 2 means MODEL2 is preferred.
Each instance has two preference labels to account for both the original order and the swapped response order in pairwise comparisons.
Using EvalTree, we first run the automatic four-stage tree construction pipeline on each benchmark.
We first run the Capability Annotation stage. Precomputed results are available in Datasets/BENCHMARK/EvalTree/stage1-CapabilityAnnotation/[annotation=gpt-4o-mini].json.
bash EvalTree/stage1-CapabilityAnnotation/annotate.sh
# Precomputed results are already available.We then run the Capability Embedding stage. Outputs are stored in Datasets/BENCHMARK/EvalTree/stage2-CapabilityEmbedding/[annotation=gpt-4o-mini]_[embedding=text-embedding-3-small].bin.
bash EvalTree/stage2-CapabilityEmbedding/embedding.shNext, we run the Recursive Clustering-Based Construction stage. Outputs are stored in Datasets/BENCHMARK/EvalTree/stage3-RecursiveClustering/[split=SPLIT]_[annotation=gpt-4o-mini]_[embedding=text-embedding-3-small]_[max-children=10].bin.
bash EvalTree/stage3-RecursiveClustering/build.shFinally, we run the Capability Description stage. Precomputed results are available in Datasets/BENCHMARK/EvalTree/stage3-RecursiveClustering/[split=SPLIT]_[annotation=gpt-4o-mini]_[embedding=text-embedding-3-small]_[max-children=10]_[stage4-CapabilityDescription-model=gpt-4o-mini].json.
bash EvalTree/stage4-CapabilityDescription/describe.sh
# Precomputed results are already available.In our experiments, we do not explicitly construct capability trees (i.e., the tree structure with a modelโs performance computed at each node).
Instead, we directly compute the confidence interval of the binomial test for each model's evaluation result.
This allows us to efficiently generate weakness profiles at varying threshold Datasets/BENCHMARK/eval_results/real/MODEL/EvalTree/TREE=[stage3-RecursiveClustering]_[split=SPLIT]_[annotation=gpt-4o-mini]_[embedding=text-embedding-3-small]_[max-children=10]/confidence_interval.json.
bash EvalTree/WeaknessProfile/confidence_interval.sh
# Precomputed results are already available.We run the following commands to obtain the one-level capability categorization structure of QualEval for each benchmark.
bash Baselines/QualEval/stage1-CapabilityDiscovery/discover.sh
# Precomputed results are already available.
bash Baselines/QualEval/stage2-CapabilityAssignment/assign.sh
# Precomputed results are already available.We first run all weakness profiling methods on all evaluation results.
As described in the paper, for each method, we tune its
bash Assessments/LowPerformance/run.sh
# Precomputed results are already available.We then assess all methods using Low-Performance Identification Assessment.
The assessment results are stored in Assessments/LowPerformance/results/BENCHMARK/real/MODEL.
size2val1 and num2val2 correspond to the results of
bash Assessments/LowPerformance/assess.sh
# Precomputed results are already available.Finally, we generate the result figure.
python -m Assessments.LowPerformance.results.figure
# The figure is already available.As preparation for Ground-Truth Weakness Assessment, we manually curated 10 ground-truth weaknesses at various granularities for MATH and WildChat10K, respectively.
The ground-truth weakness profile is stored in Datasets/{MATH, WildChat10K}/eval_results/synthetic/ground-truth.json.
For each benchmark, we first generate three synthetic evaluation results using the hyperparameters
bash Assessments/Synthetic/generate_synthetic-result.sh
# Precomputed results are already available.We then run all weakness profiling methods on all synthetic evaluation results.
bash Assessments/Synthetic/run.sh
# Precomputed results are already available.Finally, we assess all methods using Ground-Truth Weakness Assessment.
The assessment results are stored in Assessments/Synthetic/results/{MATH, WildChat10K}/[base=0.7]_[drate={0.2, 0.4, 0.5}]_[seed=0].
bash Assessments/Synthetic/assess.sh
# Precomputed results are already available.Finally, we generate the result figure.
python -m Assessments.Synthetic.results.figure --metrics F1
python -m Assessments.Synthetic.results.figure --metrics Precision
python -m Assessments.Synthetic.results.figure --metrics Recall
# The figures are already available.We first generate weakness profiles using all weakness profiling methods.
bash Assessments/Extrinsic/data/profile-generation.sh
# Precomputed results are already available.We then generate synthetic data inputs. For each synthetic data collection strategy, we construct a pool of synthetic data inputs. When experimenting with different seeds, we sample from this pool to simulate re-running synthetic data generation with a new seed.
bash Assessments/Extrinsic/data/generate_input.sh
# Precomputed results are already available.We generate outputs for each input across all data collection strategies.
bash Assessments/Extrinsic/data/generate_output.sh
# Precomputed results are already available.Next, we construct training sets for each data collection strategy using five different seeds.
bash Assessments/Extrinsic/data/generate_data/generate_data.sh
# Precomputed results are already available.We then finetune the initial LM (Llama 3.1 8B Instruct for MATH and DeepSeek-Coder-Base 6.7B for DS-1000) on each training set.
Training outputs are stored in ../{MATH, DS-1000}_checkpoints/STRATEGY_[seed=SEED]_[epoch=EPOCH].
bash Assessments/Extrinsic/training/train.shPrecomputed generation results and evaluation results are available in Assessments/Extrinsic/results/{MATH, DS-1000}.
Using the evaluation results, we generate the result figure.
python -m Assessments.Extrinsic.results.figure
# The figure is already available.As described in the paper, we locate the position of each instance in the test set on the capability tree by first computing its capability embedding and then traversing from the root guided by it. We precompute each test set instance's traversal trajectory on the capability tree.
bash EvalTree/WeaknessProfile/ExtractedNode_Analysis/locate.sh
# Precomputed results are already available.We then compute the performance on weakness/strength instances as the threshold varies.
Precomputed results are available in EvalTree/WeaknessProfile/ExtractedNode_Analysis/results/BENCHMARK1->BENCHMARK2.
bash EvalTree/WeaknessProfile/ExtractedNode_Analysis/analysis_varying-threshold.sh
# Precomputed results are already available.Finally, we generate the result figure.
bash EvalTree/WeaknessProfile/ExtractedNode_Analysis/results/figure.sh
python -m EvalTree.WeaknessProfile.ExtractedNode_Analysis.results.figure_instruction-following
# The figures are already available.The demo branch contains code to help you build an interface for exploring capability trees interactively.
You can see a demo of the interface.
Once you have constructed the tree for a benchmark (following the steps above) and added your own model evaluation results, proceed with the following steps:
-
Generate Capability Distinctions:
RunEvalTree/EvalTree/stage5-CapabilityDistinguishingto generate a natural language distinction for each (non-root) node. It differentiates each node from its siblings, giving a more concise and user-friendly description of its capability. -
Prepare Data for the Interface:
ExecuteEvalTree/build_data.pyto generate the necessary data files for the interface. -
Customize Metadata:
Modifymeta.jsonto include your benchmark and model information.