GitHub - AMA-CMFAI/DARE: This is the codes of "DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval"

Gemini_Generated_Image_h25dizh25dizh25d (3)

This repository contains the official code, model, and database for the paper: DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval， a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval.

🌟 Key Contributions

RPKB (R Package Knowledge Base): A curated database derived from 8,191 high-quality CRAN packages, complete with statistical metadata (Data Profiles).
DARE Model: A specialized bi-encoder model fine-tuned to fuse distributional features with function metadata, improving retrieval relevance by up to 17% (NDCG@10) over state-of-the-art open-source embeddings.
RCodingAgent: An R-oriented LLM agent designed for reliable R code generation, validated on a comprehensive suite of downstream statistical analysis tasks.

🚀 Quick Start (Zero-Configuration Inference)

We have open-sourced both our embedding model and the pre-computed ChromaDB database on Hugging Face. You can run distribution-aware function retrieval out of the box.

1. Installation

Clone this repository and install the required dependencies:

git clone https://github.com/AMA-CMFAI/DARE.git
cd DARE
pip install -r requirements.txt

2. Run the DARE Retriever

The following script automatically downloads the DARE model and the RPKB database from Hugging Face and performs a distribution-aware search.

# retrieval.py
from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer
import chromadb
import torch
import os

# 1. Load DARE Model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retriever", trust_remote_code=False)
model.to(device)

# 2. Download and Connect to RPKB Database
db_dir = "./rpkb_db"
if not os.path.exists(os.path.join(db_dir, "DARE_db")):
    print("Downloading RPKB Database from Hugging Face...")
    snapshot_download(repo_id="Stephen-SMJ/RPKB", repo_type="dataset", local_dir=db_dir, allow_patterns="DARE_db/*")

client = chromadb.PersistentClient(path=os.path.join(db_dir, "DARE_db"))
collection = client.get_collection(name="inference")

# 3. Perform Search
query = "I have a sparse matrix with high dimensionality. I need to perform PCA."
query_embedding = model.encode(query, convert_to_tensor=False).tolist()

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    include=["documents", "metadatas"]
)

# Display Results
for rank, (doc_id, meta) in enumerate(zip(results['ids'][0], results['metadatas'][0])):
    print(f"[{rank + 1}] Package: {meta.get('package_name')} :: Function: {meta.get('function_name')}")

To do:

Upload codes of RCodingAgent.
Upload the data for training and testing.
Upload the data for Benchmark.
Upload codes of evaluation.

If you find DARE, RPKB, or RCodingAgent useful in your research, please cite our work:

@article{sun2026dare,
      title={DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval}, 
      author={Maojun Sun and Yue Wu and Yifei Xie and Ruijian Han and Binyan Jiang and Defeng Sun and Yancheng Yuan and Jian Huang},
      year={2026},
      eprint={2603.04743},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2603.04743}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
paper		paper
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
retrieval.py		retrieval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌟 Key Contributions

🚀 Quick Start (Zero-Configuration Inference)

1. Installation

2. Run the DARE Retriever

To do:

Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🌟 Key Contributions

🚀 Quick Start (Zero-Configuration Inference)

1. Installation

2. Run the DARE Retriever

To do:

Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages