This repository contains the official code, model, and database for the paper: DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval, a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval.
- RPKB (R Package Knowledge Base): A curated database derived from 8,191 high-quality CRAN packages, complete with statistical metadata (Data Profiles).
- DARE Model: A specialized bi-encoder model fine-tuned to fuse distributional features with function metadata, improving retrieval relevance by up to 17% (NDCG@10) over state-of-the-art open-source embeddings.
- RCodingAgent: An R-oriented LLM agent designed for reliable R code generation, validated on a comprehensive suite of downstream statistical analysis tasks.
We have open-sourced both our embedding model and the pre-computed ChromaDB database on Hugging Face. You can run distribution-aware function retrieval out of the box.
Clone this repository and install the required dependencies:
git clone https://github.com/AMA-CMFAI/DARE.git
cd DARE
pip install -r requirements.txtThe following script automatically downloads the DARE model and the RPKB database from Hugging Face and performs a distribution-aware search.
# retrieval.py
from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer
import chromadb
import torch
import os
# 1. Load DARE Model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retriever", trust_remote_code=False)
model.to(device)
# 2. Download and Connect to RPKB Database
db_dir = "./rpkb_db"
if not os.path.exists(os.path.join(db_dir, "DARE_db")):
print("Downloading RPKB Database from Hugging Face...")
snapshot_download(repo_id="Stephen-SMJ/RPKB", repo_type="dataset", local_dir=db_dir, allow_patterns="DARE_db/*")
client = chromadb.PersistentClient(path=os.path.join(db_dir, "DARE_db"))
collection = client.get_collection(name="inference")
# 3. Perform Search
query = "I have a sparse matrix with high dimensionality. I need to perform PCA."
query_embedding = model.encode(query, convert_to_tensor=False).tolist()
results = collection.query(
query_embeddings=[query_embedding],
n_results=3,
include=["documents", "metadatas"]
)
# Display Results
for rank, (doc_id, meta) in enumerate(zip(results['ids'][0], results['metadatas'][0])):
print(f"[{rank + 1}] Package: {meta.get('package_name')} :: Function: {meta.get('function_name')}")- Upload codes of RCodingAgent.
- Upload the data for training and testing.
- Upload the data for Benchmark.
- Upload codes of evaluation.
If you find DARE, RPKB, or RCodingAgent useful in your research, please cite our work:
@article{sun2026dare,
title={DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval},
author={Maojun Sun and Yue Wu and Yifei Xie and Ruijian Han and Binyan Jiang and Defeng Sun and Yancheng Yuan and Jian Huang},
year={2026},
eprint={2603.04743},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2603.04743},
}