| | --- |
| | base_model: sentence-transformers/all-MiniLM-L6-v2 |
| | language: en |
| | license: apache-2.0 |
| | library_name: sentence-transformers |
| | pipeline_tag: feature-extraction |
| | tags: |
| | - sentence-transformers |
| | - feature-extraction |
| | - sentence-similarity |
| | - retrieval |
| | - tool-use |
| | - llm-agent |
| | - r-language |
| | --- |
| | |
| |  |
| |
|
| | DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**. |
| |
|
| | It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows. |
| |
|
| | - **Paper:** [DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval](https://huggingface.co/papers/2603.04743) |
| | - **Repository:** [GitHub](https://github.com/AMA-CMFAI/DARE) |
| | - **Project Page:** [DARE Webpage](https://ama-cmfai.github.io/DARE_webpage/) |
| |
|
| | ## Model Details |
| | - **Architecture:** Bi-encoder (Sentence Transformer) |
| | - **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters) |
| | - **Task:** Dense Retrieval for Tool-Augmented LLMs |
| | - **Performance**: SoTA on R package retrieval tasks (93.47% NDCG@10). |
| | - **Domain:** R programming language, Data Science, Statistical Analysis functions |
| |
|
| | ### Usage (Sentence-Transformers) |
| |
|
| | First, install the `sentence-transformers` library: |
| | ```bash |
| | pip install -U sentence-transformers |
| | ``` |
| |
|
| | ### Usage with RPKB (Recommended) |
| | Download the [R Package Knowledge Base (RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB) to perform conditional retrieval. |
| |
|
| | ```python |
| | from huggingface_hub import snapshot_download |
| | import chromadb |
| | import os |
| | |
| | # 1. Download the database folder from Hugging Face |
| | db_path = snapshot_download( |
| | repo_id="Stephen-SMJ/RPKB", |
| | repo_type="dataset", |
| | allow_patterns="RPKB/*" |
| | ) |
| | |
| | # 2. Connect to the local ChromaDB instance |
| | client = chromadb.PersistentClient(path=f"{db_path}/RPKB") |
| | |
| | # 3. Access the specific collection |
| | collection = client.get_collection(name="inference") |
| | |
| | print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!") |
| | ``` |
| |
|
| | ### Retrieval with DARE |
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | |
| | # 1. Load the DARE model |
| | model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval") |
| | |
| | # 2. Define the exact input format: Query + Data Profile |
| | query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided |
| | in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the |
| | first value of the estimated scores (est_a) for the very first region identified." |
| | |
| | # 3. Generate embedding |
| | query_embedding = model.encode(query).tolist() |
| | |
| | # 4. Search in the database |
| | results = collection.query( |
| | query_embeddings=[query_embedding], |
| | n_results=3, |
| | include=["metadatas", "distances", "documents"] |
| | ) |
| | |
| | # Display Top-1 Result |
| | print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"]) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you find DARE, RPKB, or RCodingAgent useful in your research, please cite: |
| |
|
| | ```bibtex |
| | @article{sun2026dare, |
| | title={DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval}, |
| | author={Maojun Sun and Yue Wu and Yifei Xie and Ruijian Han and Binyan Jiang and Defeng Sun and Yancheng Yuan and Jian Huang}, |
| | year={2026}, |
| | eprint={2603.04743}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.IR}, |
| | url={https://arxiv.org/abs/2603.04743}, |
| | } |
| | ``` |