You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We introduce ***ReCall***, a novel framework that trains LLMs to ***Re***ason with Tool ***Call*** via reinforcement learning—without requiring any supervised data on tool use trajectories or reasoning steps. *ReCall* empowers LLMs to agentically use and combine arbitrary tools like [OpenAI o3](https://openai.com/index/introducing-o3-and-o4-mini/), offering an accessible approach toward general-purpose agents. Additionally, we provide a novel perspective to generate synthetic data with diverse environments and complex multi-step tasks, enabling LLMs to develop sophisticated tool-based reasoning capabilities. This is a work in progress and we are actively working on it.
10
+
11
+
> [!IMPORTANT]
12
+
> *ReCall* is the successor to [*ReSearch*](https://arxiv.org/abs/2503.19470) and represents a more comprehensive framework that extends beyond the search tool to support reasoning with any user-defined tools. It can be a drop-in replacement of *ReSearch*. We've archived the original implementation of *ReSearch* in the branch `re-search`.
We propose ***ReSearch***, a novel framework that trains LLMs to ***Re***ason with ***Search*** via reinforcement learning without using any supervised data on reasoning steps. Our approach treats search operations as integral components of the reasoning chain, where when and how to perform searches is guided by text-based thinking, and search results subsequently influence further reasoning.
15
-
16
19
## 📰 News
17
-
-**[2025-03-27]** 🤗 We release our trained models on [Hugging Face](https://huggingface.co/collections/agentrl/research-67e506a0311bea06dc54878b), please check it out!
18
-
-**[2025-03-26]** 🎉 We release the paper, update the code and open-source the models.
20
+
-**[2025-04-24]** 🎉 We release the first version of *ReCall*, and archive the original implementation of *ReSearch*.
21
+
- ➡️ The name of the repository is changed from *ReSearch* to *ReCall*.
22
+
- 📝 We release a [blog](https://attractive-almandine-935.notion.site/ReCall-Learning-to-Reason-with-Tool-Call-for-LLMs-via-Reinforcement-Learning-1d7aec91e9bb8006ad40f9edbfe2191a) to introduce the idea of *ReCall*.
23
+
- 📦 Current implementation of *ReCall* is based on verl 0.3.0 + vllm 0.8.4.
24
+
-**[2025-03-27]** 🤗 We release our trained *ReSearch* models on [Hugging Face](https://huggingface.co/collections/agentrl/research-67e506a0311bea06dc54878b), please check it out!
25
+
-**[2025-03-26]** 🎉 We release the paper and update the code of *ReSearch*.
19
26
- 📝 The **paper is released** on arXiv, more details and evaluation results can be found in our [paper](https://arxiv.org/abs/2503.19470).
20
27
- 🛠️ The **repository is updated** with the new implementation, especially the rollout with search during RL training. This version of implementation is based on the latest release of verl.
21
-
-**[2025-03-03]** ✅ We have released the preview version of ReSearch implementation.
28
+
-**[2025-03-03]** ✅ We have released the preview version of *ReSearch* implementation.
22
29
23
30
## 📦 Installation
24
31
25
32
We recommend using conda to manage the environment. First create a conda environment and activate it.
26
33
```bash
27
-
conda create -n re-search python==3.10
28
-
conda activate re-search
34
+
conda create -n re-call python==3.10
35
+
conda activate re-call
29
36
```
30
-
Then install dependencies, and our modified verl and flashrag packages under ```src/``` will be installed in the editable mode. Check out ```setup.py``` for details.
37
+
Then install dependencies, and the packages under ```src/``` will be installed in the editable mode. Check out ```setup.py``` for details.
As described in the [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#wrench-installation), due to the incompatibility when installing faiss using pip, we need to use the following conda command to install faiss-gpu.
44
+
If you want to host a Wikipedia RAG system based on FlashRAG, you need to install faiss-gpu as follow. As described in the [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#wrench-installation), due to the incompatibility when installing faiss using pip, we need to use the following conda command to install faiss-gpu.
> If you want to learn the details of current version of *ReCall*, please refer to the [blog](https://attractive-almandine-935.notion.site/ReCall-Learning-to-Reason-with-Tool-Call-for-LLMs-via-Reinforcement-Learning-1d7aec91e9bb8006ad40f9edbfe2191a) first.
52
+
53
+
### Data Preparation
46
54
47
-
As described in our paper, during model training and evaluation, search operation will be conducted in the rollout and inference process. In practice, we host a retriever service via FlashRAG and FastAPI. Hence, the search operation is standardized to be an API call. This serving can be used to decouple the search operation from the reinforcement learning process, making the training and evaluation more clear and flexible.
55
+
*ReCall* is trained on a mixture of our synthetic dataset `SynTool`and the training set of `MuSiQue`. You can download the preprocessed training data from [here](https://huggingface.co/datasets/agentrl/ReCall-data), and use such data directly for training.
48
56
49
-
Before starting the retriever serving, you need download the [pre-indexed wikipedia](https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#index), [wikipedia corpus and corresponding retriever models](https://github.com/RUC-NLPIR/FlashRAG/blob/main/docs/original_docs/reproduce_experiment.md#preliminary). More details can be found in the documentation of FlashRAG.
57
+
### Sandbox Serving
50
58
51
-
For starting the retriever serving, you need to first fill the `scripts/serving/retriever_config.yaml` with the correct path to the retrieval model, index, and corpus, and available GPU ids. Then, you can run the following command to start the retriever serving:
59
+
Since tools are implemented in executable Python code, the tool executor is responsible for running the Python code. To ensure safety and security, we implement a sandbox for running Python code on a remote server. To launch the sandbox service, run the following command:
52
60
```bash
53
61
cd scripts/serving
54
-
python retriever_serving.py \
55
-
--config retriever_config.yaml \
56
-
--num_retriever {num_retriever} \
57
-
--port {port}
62
+
python sandbox.py --port {port}
58
63
```
64
+
Note: The current implementation is a basic sandbox environment. We plan to use a more robust and secure sandbox in future updates. We recommend hosting the sandbox on a remote server, as local hosting may expose your machine to potential security risks.
59
65
60
-
The started retriever serving will be used in the training and evaluation process in the following part.
61
-
62
-
### Data Preparation
66
+
### Retriever Serving
63
67
64
-
*ReSearch* is trained on the training set of MuSiQue, and evaluated on the dev set of HotpotQA, 2WikiMultiHopQA, MuSiQue and Bamboogle. For downloading the datasets, please refer to the `data/download_dataset.sh` script.
65
-
```bash
66
-
cd data
67
-
bash download_dataset.sh
68
-
```
68
+
For training on MuSiQue data with a Wikipedia search tool, we provide a Wikipedia retriever service implemented using FlashRAG and FastAPI. Before starting the retriever serving, you need download the [pre-indexed wikipedia](https://github.com/RUC-NLPIR/FlashRAG?tab=readme-ov-file#index), [wikipedia corpus and corresponding retriever models](https://github.com/RUC-NLPIR/FlashRAG/blob/main/docs/original_docs/reproduce_experiment.md#preliminary). More details can be found in the documentation of FlashRAG.
69
69
70
-
For preparing the training and validation data for following reinforcement learning, please run this script to parse the MuSiQue dataset to the parquet format.
70
+
For starting the retriever serving, you need to first fill the `scripts/serving/retriever_config.yaml` with the correct path to the retrieval model, index, and corpus, and available GPU ids. Then, you can run the following command to start the retriever serving:
71
71
```bash
72
-
cd data
73
-
python prepare_musique.py
72
+
cd scripts/serving
73
+
python retriever_serving.py \
74
+
--config retriever_config.yaml \
75
+
--num_retriever {num_retriever} \
76
+
--port {port}
74
77
```
75
78
76
79
### Training
@@ -83,11 +86,12 @@ Here is an example of training Qwen2.5-7B-Instruct with 4 GPUs locally. Note tha
- For training base (pre-trained) models, please use `--apply_chat False` and `--prompt_template_name re_search_template`
104
-
- For training instruction-tuned models, please use `--apply_chat True` and `--prompt_template_name re_search_template_sys`
105
107
106
108
#### Multi-node training
107
109
108
-
If you want to **fully reproduce**the results in our paper, please refer to the multi-node training script in `scripts/train/train_multi_node.sh`, as well as the implementation details in our paper.
110
+
If you want to **fully reproduce***ReCall*, please refer to the multi-node training script in `scripts/train/train_multi_node.sh`.
109
111
110
-
### Evaluation
111
-
112
-
We recommend using [SGLang](https://docs.sglang.ai/) to serve the trained model. You can download our open-sourced models or trained your own models to conduct the evaluation. Here is an example of launching the model serving:
112
+
### Inference
113
+
This section demonstrates how to perform inference using the trained *ReCall* model. We provide a standard wrapper class in `src/re_call/inference/re_call.py` that simplifies the inference process. To get started, you only need to provide the model URL and sandbox URL, then use the `run` function to execute inference. The `ReCall` class handles all the orchestration between model generation and tool execution internally. For a practical example of using the `ReCall` class, please refer to our sample implementation at `scripts/inference/re_call_use_case.py`.
114
+
115
+
For model serving, we recommend using [SGLang](https://docs.sglang.ai/). You can either download our open-source models or train your own models to conduct the inference. Here is an example of how to launch the model service:
We use [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG) as the standard evaluation environment. Here is an example of evaluating the performance of ReSearch-Qwen-7B-Instruct on Bamboogle test set.
131
+
### Evaluation
132
+
133
+
#### Multi-hop QA
134
+
135
+
For the evaluation on multi-hop QA, we use [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG) as the standard evaluation environment. For downloading the evaluation data, please run the following command:
136
+
```bash
137
+
cd data
138
+
bash download_dataset.sh
139
+
```
140
+
Here is an example of evaluating the performance of ReCall-Qwen-7B-Instruct on Bamboogle test set.
For more details about the configuration, please refer to the `scripts/evaluation/eval_config.yaml` file.
144
157
145
-
For base model, please use `--apply_chat False` and for instruction-tuned model, please use `--apply_chat True`, for loading correct prompt template when conducting evaluation for *ReSearch* model. For more details about the configuration, please refer to the `scripts/evaluation/eval_config.yaml` file.
158
+
#### BFCL
159
+
We will release the evaluation code on BFCL soon.
146
160
147
161
## 🤝 Acknowledge
148
162
149
-
This training implementation is based on [verl](https://github.com/volcengine/verl) and the evaluation is based on [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG). The serving of retriever is based on [FastAPI](https://github.com/fastapi/fastapi). The model serving is based on [SGLang](https://docs.sglang.ai/). *ReSearch* models are trained based on [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5/). We sincerely appreciate their contributions to the open-source community.
163
+
This training implementation is based on [verl](https://github.com/volcengine/verl) and the evaluation is based on [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG) and BFCL. The serving of sandbox and retriever is based on [FastAPI](https://github.com/fastapi/fastapi). The model serving is based on [SGLang](https://docs.sglang.ai/). *ReCall* models are trained based on [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5/). We sincerely appreciate their contributions to the open-source community.
0 commit comments