Improve model card: add metadata, paper link, and project resources

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +33 -14
README.md CHANGED
@@ -1,5 +1,9 @@
1
  ---
 
2
  language: en
 
 
 
3
  tags:
4
  - sentence-transformers
5
  - feature-extraction
@@ -8,26 +12,25 @@ tags:
8
  - tool-use
9
  - llm-agent
10
  - r-language
11
- license: apache-2.0
12
- base_model: sentence-transformers/all-MiniLM-L6-v2
13
  ---
14
 
15
- ![Gemini_Generated_Image_h25dizh25dizh25d (3)](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png)
16
 
17
  DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.
18
 
19
  It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.
20
 
 
 
 
 
21
  ## Model Details
22
  - **Architecture:** Bi-encoder (Sentence Transformer)
23
  - **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
24
  - **Task:** Dense Retrieval for Tool-Augmented LLMs
25
- - **Performance**: SoTA on R package retrieval tasks.
26
  - **Domain:** R programming language, Data Science, Statistical Analysis functions
27
 
28
- <!-- ## 💡 Why DARE? (The Input Formatting)
29
- Unlike traditional semantic search models that only take a natural language query, DARE is trained to be **distribution-conditional**. It expects a concatenated input of the user's intent AND the data profile (e.g., high-dimensional, sparse, categorical). -->
30
-
31
  ### Usage (Sentence-Transformers)
32
 
33
  First, install the `sentence-transformers` library:
@@ -35,18 +38,19 @@ First, install the `sentence-transformers` library:
35
  pip install -U sentence-transformers
36
  ```
37
 
38
- ### Usage by our RPKB (Optional and Recommended)
39
- Download the [R Package Knowledge Base(RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB)
40
 
41
  ```python
42
  from huggingface_hub import snapshot_download
43
  import chromadb
 
44
 
45
  # 1. Download the database folder from Hugging Face
46
  db_path = snapshot_download(
47
  repo_id="Stephen-SMJ/RPKB",
48
  repo_type="dataset",
49
- allow_patterns="RPKB/*" # Adjust this if your folder name is different
50
  )
51
 
52
  # 2. Connect to the local ChromaDB instance
@@ -58,10 +62,9 @@ collection = client.get_collection(name="inference")
58
  print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
59
  ```
60
 
61
- ### Then, you can load the DARE model do retrieval:
62
  ```python
63
  from sentence_transformers import SentenceTransformer
64
- from sentence_transformers.util import cos_sim
65
 
66
  # 1. Load the DARE model
67
  model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
@@ -72,9 +75,9 @@ in the data. Please set the random seed to 123 at the start. I need to filter fo
72
  first value of the estimated scores (est_a) for the very first region identified."
73
 
74
  # 3. Generate embedding
75
- query_embedding = model.encode(user_query).tolist()
76
 
77
- # 4. Search in the database with Hard Filters
78
  results = collection.query(
79
  query_embeddings=[query_embedding],
80
  n_results=3,
@@ -83,4 +86,20 @@ results = collection.query(
83
 
84
  # Display Top-1 Result
85
  print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ```
 
1
  ---
2
+ base_model: sentence-transformers/all-MiniLM-L6-v2
3
  language: en
4
+ license: apache-2.0
5
+ library_name: sentence-transformers
6
+ pipeline_tag: feature-extraction
7
  tags:
8
  - sentence-transformers
9
  - feature-extraction
 
12
  - tool-use
13
  - llm-agent
14
  - r-language
 
 
15
  ---
16
 
17
+ ![DARE Banner](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png)
18
 
19
  DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on **both user queries and conditional on data profile**.
20
 
21
  It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.
22
 
23
+ - **Paper:** [DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval](https://huggingface.co/papers/2603.04743)
24
+ - **Repository:** [GitHub](https://github.com/AMA-CMFAI/DARE)
25
+ - **Project Page:** [DARE Webpage](https://ama-cmfai.github.io/DARE_webpage/)
26
+
27
  ## Model Details
28
  - **Architecture:** Bi-encoder (Sentence Transformer)
29
  - **Base Model:** `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
30
  - **Task:** Dense Retrieval for Tool-Augmented LLMs
31
+ - **Performance**: SoTA on R package retrieval tasks (93.47% NDCG@10).
32
  - **Domain:** R programming language, Data Science, Statistical Analysis functions
33
 
 
 
 
34
  ### Usage (Sentence-Transformers)
35
 
36
  First, install the `sentence-transformers` library:
 
38
  pip install -U sentence-transformers
39
  ```
40
 
41
+ ### Usage with RPKB (Recommended)
42
+ Download the [R Package Knowledge Base (RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB) to perform conditional retrieval.
43
 
44
  ```python
45
  from huggingface_hub import snapshot_download
46
  import chromadb
47
+ import os
48
 
49
  # 1. Download the database folder from Hugging Face
50
  db_path = snapshot_download(
51
  repo_id="Stephen-SMJ/RPKB",
52
  repo_type="dataset",
53
+ allow_patterns="RPKB/*"
54
  )
55
 
56
  # 2. Connect to the local ChromaDB instance
 
62
  print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
63
  ```
64
 
65
+ ### Retrieval with DARE
66
  ```python
67
  from sentence_transformers import SentenceTransformer
 
68
 
69
  # 1. Load the DARE model
70
  model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")
 
75
  first value of the estimated scores (est_a) for the very first region identified."
76
 
77
  # 3. Generate embedding
78
+ query_embedding = model.encode(query).tolist()
79
 
80
+ # 4. Search in the database
81
  results = collection.query(
82
  query_embeddings=[query_embedding],
83
  n_results=3,
 
86
 
87
  # Display Top-1 Result
88
  print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
89
+ ```
90
+
91
+ ## Citation
92
+
93
+ If you find DARE, RPKB, or RCodingAgent useful in your research, please cite:
94
+
95
+ ```bibtex
96
+ @article{sun2026dare,
97
+ title={DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval},
98
+ author={Maojun Sun and Yue Wu and Yifei Xie and Ruijian Han and Binyan Jiang and Defeng Sun and Yancheng Yuan and Jian Huang},
99
+ year={2026},
100
+ eprint={2603.04743},
101
+ archivePrefix={arXiv},
102
+ primaryClass={cs.IR},
103
+ url={https://arxiv.org/abs/2603.04743},
104
+ }
105
  ```