Improve model card: add metadata, paper link, and project resources
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,5 +1,9 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
language: en
|
|
|
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
- sentence-transformers
|
| 5 |
- feature-extraction
|
|
@@ -8,26 +12,25 @@ tags:
|
|
| 8 |
- tool-use
|
| 9 |
- llm-agent
|
| 10 |
- r-language
|
| 11 |
-
license: apache-2.0
|
| 12 |
-
base_model: sentence-transformers/all-MiniLM-L6-v2
|
| 13 |
---
|
| 14 |
|
| 15 |
-

|
| 40 |
|
| 41 |
```python
|
| 42 |
from huggingface_hub import snapshot_download
|
| 43 |
import chromadb
|
|
|
|
| 44 |
|
| 45 |
# 1. Download the database folder from Hugging Face
|
| 46 |
db_path = snapshot_download(
|
| 47 |
repo_id="Stephen-SMJ/RPKB",
|
| 48 |
repo_type="dataset",
|
| 49 |
-
allow_patterns="RPKB/*"
|
| 50 |
)
|
| 51 |
|
| 52 |
# 2. Connect to the local ChromaDB instance
|
|
@@ -58,10 +62,9 @@ collection = client.get_collection(name="inference")
|
|
| 58 |
print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
|
| 59 |
```
|
| 60 |
|
| 61 |
-
###
|
| 62 |
```python
|
| 63 |
from sentence_transformers import SentenceTransformer
|
| 64 |
-
from sentence_transformers.util import cos_sim
|
| 65 |
|
| 66 |
# 1. Load the DARE model
|
| 67 |
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
|
|
@@ -72,9 +75,9 @@ in the data. Please set the random seed to 123 at the start. I need to filter fo
|
|
| 72 |
first value of the estimated scores (est_a) for the very first region identified."
|
| 73 |
|
| 74 |
# 3. Generate embedding
|
| 75 |
-
query_embedding = model.encode(
|
| 76 |
|
| 77 |
-
# 4. Search in the database
|
| 78 |
results = collection.query(
|
| 79 |
query_embeddings=[query_embedding],
|
| 80 |
n_results=3,
|
|
@@ -83,4 +86,20 @@ results = collection.query(
|
|
| 83 |
|
| 84 |
# Display Top-1 Result
|
| 85 |
print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
```
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model: sentence-transformers/all-MiniLM-L6-v2
|
| 3 |
language: en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
library_name: sentence-transformers
|
| 6 |
+
pipeline_tag: feature-extraction
|
| 7 |
tags:
|
| 8 |
- sentence-transformers
|
| 9 |
- feature-extraction
|
|
|
|
| 12 |
- tool-use
|
| 13 |
- llm-agent
|
| 14 |
- r-language
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
+

|
| 18 |
|
| 19 |
DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.
|
| 20 |
|
| 21 |
It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.
|
| 22 |
|
| 23 |
+
- **Paper:** [DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval](https://huggingface.co/papers/2603.04743)
|
| 24 |
+
- **Repository:** [GitHub](https://github.com/AMA-CMFAI/DARE)
|
| 25 |
+
- **Project Page:** [DARE Webpage](https://ama-cmfai.github.io/DARE_webpage/)
|
| 26 |
+
|
| 27 |
## Model Details
|
| 28 |
- **Architecture:** Bi-encoder (Sentence Transformer)
|
| 29 |
- **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
|
| 30 |
- **Task:** Dense Retrieval for Tool-Augmented LLMs
|
| 31 |
+
- **Performance**: SoTA on R package retrieval tasks (93.47% NDCG@10).
|
| 32 |
- **Domain:** R programming language, Data Science, Statistical Analysis functions
|
| 33 |
|
|
|
|
|
|
|
|
|
|
| 34 |
### Usage (Sentence-Transformers)
|
| 35 |
|
| 36 |
First, install the `sentence-transformers` library:
|
|
|
|
| 38 |
pip install -U sentence-transformers
|
| 39 |
```
|
| 40 |
|
| 41 |
+
### Usage with RPKB (Recommended)
|
| 42 |
+
Download the [R Package Knowledge Base (RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB) to perform conditional retrieval.
|
| 43 |
|
| 44 |
```python
|
| 45 |
from huggingface_hub import snapshot_download
|
| 46 |
import chromadb
|
| 47 |
+
import os
|
| 48 |
|
| 49 |
# 1. Download the database folder from Hugging Face
|
| 50 |
db_path = snapshot_download(
|
| 51 |
repo_id="Stephen-SMJ/RPKB",
|
| 52 |
repo_type="dataset",
|
| 53 |
+
allow_patterns="RPKB/*"
|
| 54 |
)
|
| 55 |
|
| 56 |
# 2. Connect to the local ChromaDB instance
|
|
|
|
| 62 |
print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
|
| 63 |
```
|
| 64 |
|
| 65 |
+
### Retrieval with DARE
|
| 66 |
```python
|
| 67 |
from sentence_transformers import SentenceTransformer
|
|
|
|
| 68 |
|
| 69 |
# 1. Load the DARE model
|
| 70 |
model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
|
|
|
|
| 75 |
first value of the estimated scores (est_a) for the very first region identified."
|
| 76 |
|
| 77 |
# 3. Generate embedding
|
| 78 |
+
query_embedding = model.encode(query).tolist()
|
| 79 |
|
| 80 |
+
# 4. Search in the database
|
| 81 |
results = collection.query(
|
| 82 |
query_embeddings=[query_embedding],
|
| 83 |
n_results=3,
|
|
|
|
| 86 |
|
| 87 |
# Display Top-1 Result
|
| 88 |
print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## Citation
|
| 92 |
+
|
| 93 |
+
If you find DARE, RPKB, or RCodingAgent useful in your research, please cite:
|
| 94 |
+
|
| 95 |
+
```bibtex
|
| 96 |
+
@article{sun2026dare,
|
| 97 |
+
title={DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval},
|
| 98 |
+
author={Maojun Sun and Yue Wu and Yifei Xie and Ruijian Han and Binyan Jiang and Defeng Sun and Yancheng Yuan and Jian Huang},
|
| 99 |
+
year={2026},
|
| 100 |
+
eprint={2603.04743},
|
| 101 |
+
archivePrefix={arXiv},
|
| 102 |
+
primaryClass={cs.IR},
|
| 103 |
+
url={https://arxiv.org/abs/2603.04743},
|
| 104 |
+
}
|
| 105 |
```
|