This toolkit processes MMLU questions, merges them with DOVE robustness scores, and generates model weakness profiles for deeper performance analysis.
It is designed to work alongside the EvalTree framework, replacing accuracy-based rankings with robustness-based rankings.
-
extract_mmlu_questions.py
Extracts questions from MMLU and produces a JSON file mapping each MMLU question to its index. -
extract_dove_scores_Llama.py,extract_dove_scores_OLMoE.py
Extract DOVE robustness scores for LLaMA and OLMoE models. -
merge_dove_score_with_mmlu_accuracy_score.py
Combines a model’s DOVE scores with the corresponding MMLU question indices. -
replace_accuracy_ranking_in_DOVE.py
Updates EvalTree rankings by replacing accuracy scores with DOVE robustness scores. -
weakness_question_generator.py
Produces a weakness profile summarizing model performance across different question types.
-
EvalTree/
The original EvalTree repository (unmodified baseline). -
Question Generation/
Scripts for generating and evaluating new questions based on the generated weakness profile. -
Replace Accuracy For DOVE Ranking/
Modified EvalTree trees where accuracy scores are replaced with DOVE robustness scores. -
plots/
Generated visualizations illustrating the results. -
data/
Contains the trees and tables used in analysis, including input data for processing and evaluation.