Enviroment β’ Data β’ Evaluation
We investigate the tool learning ability of 41 prevalent LLMs by reproducing 33 benchmarks and enabling one-click evaluation for seven of them, forming a Tool Learning Platform named ToLeaP. Our motivation is to deepen insights into current work and thus facilitate foresight into future directions in the tool learning domain. ToLeaP comprises 7 out of 33 benchmarks and possesses the functionality that takes an LLM as input and outputs the values of all 64 evaluation metrics proposed by the benchmarks.
conda create -n toleap python=3.10 -y && conda activate toleap
git clone https://github.com/Hytn/ToLeaP.git && cd ToLeap
pip install -e .
cd scripts
pip install vllm==0.6.5
pip install rouge_score # taskbench
pip install mmengine # teval
pip install nltk accelerate # injecagent
bash ../src/benchmark/bfcl/bfcl_setup.sh
Note: Please use vllm==0.6.5. If you want to test new models such as the Qwen3 series, it is recommended to use the Transformers library instead of vllm. The latest version of vllm has some conflicts with our code implementation, which may affect the results of benchmarks such as RoTBench.
First, run
cd data
mkdir rotbench sealtools taskbench injecagent glaive stabletoolbench apibank
cd ..
cd src/benchmark/rotbench
bash rotbench.sh
cd src/benchmark/sealtools
bash sealtools.sh
cd src/benchmark/taskbench
python taskbench.py
cd src/benchmark/glaive
python glaive2sharegpt.py
cd data
unzip teval.zip
rm teval.zip
cd src/benchmark/injecagent
bash injecagent.sh
Download the data from this link, and place the files in the data/stabletoolbench folder.
After downloading the data, the directory structure should look like this:
βββ /data/
β βββ /glaive/
β β βββ
β β βββ ...
β βββ /injecagent/
β β βββ attacker_cases_dh.jsonl
β β βββ ...
βββ /scripts/
β βββ /gorilla/
β βββ bfcl_standard.py
β βββ ...
βββ /src/
β βββ /benchmark/
β β βββ ...
β βββ /cfg/
β β βββ ...
β βββ /utils/
β β βββ ...
First, run:
mkdir results
cd results
mkdir rotbench sealtools taskbench teval injecagent glaive stabletoolbench
cd ..
If you want to perform one-click evaluation, run:
cd scripts
# bash one-click-evaluation.sh model_path is_api gpu_num batch_size input output display_name
bash one-click-evaluation.sh meta-llama/Llama-3.1-8B-Instruct false 1 256 4096 512 llama3.1
If you prefer to evaluate each benchmark separately, follow the instructions below. Once {model_name}_results.json is generated, you can easily convert the .json results into .csv format using json2csv.py, making it convenient for you to fill out the form.
cd scripts
python rotbench_eval.py --model meta-llama/Llama-3.1-8B-Instruct
To evaluate API models, add --is_api True.
cd scripts
python sealtools_eval.py --model meta-llama/Llama-3.1-8B-Instruct
To evaluate API models, add --is_api True.
cd scripts
python taskbench_eval.py --model meta-llama/Llama-3.1-8B-Instruct
To evaluate API models, add --is_api True.
WARNING: As the official BFCL codebase changes frequently, if the following instructions do not work, please refer to the latest official repository.
Before using BFCL for evaluation, some preparation steps are required:
-
Ensure that the model you want to evaluate is included in the handler mapping file:
scripts/gorilla/berkeley-function-call-leaderboard/bfcl/constants/model_config.py. If you want to evaluate API models, set the API key:export OPENAI_API_KEY="your-api-key"To use an unofficial base URL, modify the following code in
scripts/gorilla/berkeley-function-call-leaderboard/bfcl/model_handler/api_inference/openai.py:self.client = OpenAI( api_key=os.getenv("OPENAI_API_KEY"), base_url=os.getenv("OPENAI_API_BASE") )Then
export OPENAI_API_KEY="your-api-key" export OPENAI_API_BASE="your-base-url" -
To add the
--max-model-lenor--tensor-parallel-sizeparameters, modify the code around line 130 in:scripts/gorilla/berkeley-function-call-leaderboard/bfcl/model_handler/local_inference/base_oss_handler.py. -
To run the evaluation in parallel, change the
VLLM_PORTin:scripts/gorilla/berkeley-function-call-leaderboard/bfcl/constants/eval_config.py. -
If you want to use a locally trained model, ensure the model path does not contain underscores. Otherwise, to avoid conflicts, manually add the following code after
model_name_escaped = model_name.replace("_", "/"):
-
In the
generate_leaderboard_csvfunction inscripts/gorilla/berkeley-function-call-leaderboard/bfcl/eval_checker/eval_runner_helper.py. -
And also in the
runnerfunction inscripts/gorilla/berkeley-function-call-leaderboard/bfcl/eval_checker/eval_runner.py.if model_name == "sft_model_merged_lora_checkpoint-20000": model_name_escaped = "/sft_model/merged_lora/checkpoint-20000"
- To ensure the evaluation results are properly recorded, add the model path to:
scripts/gorilla/berkeley-function-call-leaderboard/bfcl/constants/model_metadata.py. Example:MODEL_METADATA_MAPPING = { "/path/to/sft_model/merged_lora/checkpoint-60000": [ "", "", "", "", ], ... }
Finally, run
bfcl generate \
--model meta-llama/Llama-3.1-8B-Instruct \
--test-category parallel,multiple,simple,parallel_multiple,java,javascript,irrelevance,live,multi_turn \
--num-threads 1
bfcl evaluate --model meta-llama/Llama-3.1-8B-Instruct
cd scripts
python glaive_eval.py --model meta-llama/Llama-3.1-8B-Instruct
To evaluate API models, add --is_api True.
cd scripts
bash teval_eval.sh meta-llama/Llama-3.1-8B-Instruct Llama-3.1-8B-Instruct False 4
Then run
python standard_teval.py ../results/teval/Llama-3.1-8B-Instruct/Llama-3.1-8B-Instruct_-1_.json
to obtain the clean results.
To evaluate API models, run:
bash teval_eval.sh gpt-3.5-turbo gpt-3.5-turbo True
cd scripts
export OPENAI_API_KEY="your-open-api-kei"
python injecagent_eval.py \
--model_name meta-llama/Llama-3.1-8B-Instruct \
--use_cach
cd scripts
python apibank_eval.py --model_name meta-llama/Llama-3.1-8B-Instruct
To evaluate API models, add --is_api True.
First, set up the environment:
pip install skythought
According to the original author's recommendation, you must use datasets==2.21.0. Otherwise, some benchmarks will not run correctly.
Then run:
bash one-click-sky.sh
to evaluate all tasks. You can specify models and tasks within the script.