Stephen-SMJ
/

DARE-R-Retriever

@@ -1,5 +1,9 @@
 ---
 language: en
 tags:
 - sentence-transformers
 - feature-extraction
@@ -8,26 +12,25 @@ tags:
 - tool-use
 - llm-agent
 - r-language
-license: apache-2.0
-base_model: sentence-transformers/all-MiniLM-L6-v2
 ---
-![Gemini_Generated_Image_h25dizh25dizh25d (3)](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png)
 DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.
 It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.
 ## Model Details
 - **Architecture:** Bi-encoder (Sentence Transformer)
 - **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
 - **Task:** Dense Retrieval for Tool-Augmented LLMs
-- **Performance**: SoTA on R package retrieval tasks.
 - **Domain:** R programming language, Data Science, Statistical Analysis functions
-<!-- ## 💡 Why DARE? (The Input Formatting)
-Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->
 ### Usage (Sentence-Transformers)
 First, install the `sentence-transformers` library:
@@ -35,18 +38,19 @@ First, install the `sentence-transformers` library:
 pip install -U sentence-transformers
 ```
-### Usage by our RPKB (Optional and Recommended)
-Download the [R Package Knowledge Base(RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB)
 ```python
 from huggingface_hub import snapshot_download
 import chromadb
 # 1. Download the database folder from Hugging Face
 db_path = snapshot_download(
     repo_id="Stephen-SMJ/RPKB",
     repo_type="dataset",
-    allow_patterns="RPKB/*"  # Adjust this if your folder name is different
 )
 # 2. Connect to the local ChromaDB instance
@@ -58,10 +62,9 @@ collection = client.get_collection(name="inference")
 print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
 ```
-### Then, you can load the DARE model do retrieval:
 ```python
 from sentence_transformers import SentenceTransformer
-from sentence_transformers.util import cos_sim
 # 1. Load the DARE model
 model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
@@ -72,9 +75,9 @@ in the data. Please set the random seed to 123 at the start. I need to filter fo
 first value of the estimated scores (est_a) for the very first region identified."
 # 3. Generate embedding
-query_embedding = model.encode(user_query).tolist()
-# 4. Search in the database with Hard Filters
 results = collection.query(
     query_embeddings=[query_embedding],
     n_results=3,
@@ -83,4 +86,20 @@ results = collection.query(
 # Display Top-1 Result
 print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
 ```

 ---
+base_model: sentence-transformers/all-MiniLM-L6-v2
 language: en
+license: apache-2.0
+library_name: sentence-transformers
+pipeline_tag: feature-extraction
 tags:
 - sentence-transformers
 - feature-extraction
 - tool-use
 - llm-agent
 - r-language
 ---
+![DARE Banner](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png)
 DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.
 It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.
+- **Paper:** [DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval](https://huggingface.co/papers/2603.04743)
+- **Repository:** [GitHub](https://github.com/AMA-CMFAI/DARE)
+- **Project Page:** [DARE Webpage](https://ama-cmfai.github.io/DARE_webpage/)
 ## Model Details
 - **Architecture:** Bi-encoder (Sentence Transformer)
 - **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
 - **Task:** Dense Retrieval for Tool-Augmented LLMs
+- **Performance**: SoTA on R package retrieval tasks (93.47% NDCG@10).
 - **Domain:** R programming language, Data Science, Statistical Analysis functions
 ### Usage (Sentence-Transformers)
 First, install the `sentence-transformers` library:
 pip install -U sentence-transformers
 ```
+### Usage with RPKB (Recommended)
+Download the [R Package Knowledge Base (RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB) to perform conditional retrieval.
 ```python
 from huggingface_hub import snapshot_download
 import chromadb
+import os
 # 1. Download the database folder from Hugging Face
 db_path = snapshot_download(
     repo_id="Stephen-SMJ/RPKB",
     repo_type="dataset",
+    allow_patterns="RPKB/*"
 )
 # 2. Connect to the local ChromaDB instance
 print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
 ```
+### Retrieval with DARE
 ```python
 from sentence_transformers import SentenceTransformer
 # 1. Load the DARE model
 model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
 first value of the estimated scores (est_a) for the very first region identified."
 # 3. Generate embedding
+query_embedding = model.encode(query).tolist()
+# 4. Search in the database
 results = collection.query(
     query_embeddings=[query_embedding],
     n_results=3,
 # Display Top-1 Result
 print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
+```
+## Citation
+If you find DARE, RPKB, or RCodingAgent useful in your research, please cite:
+```bibtex
+@article{sun2026dare,
+      title={DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval},
+      author={Maojun Sun and Yue Wu and Yifei Xie and Ruijian Han and Binyan Jiang and Defeng Sun and Yancheng Yuan and Jian Huang},
+      year={2026},
+      eprint={2603.04743},
+      archivePrefix={arXiv},
+      primaryClass={cs.IR},
+      url={https://arxiv.org/abs/2603.04743},
+}
 ```