Improve model card: add metadata, paper link, and project resources (#1)

e03a9e7 1 day ago

3.91 kB

	---
	base_model: sentence-transformers/all-MiniLM-L6-v2
	language: en
	license: apache-2.0
	library_name: sentence-transformers
	pipeline_tag: feature-extraction
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- retrieval
	- tool-use
	- llm-agent
	- r-language
	---

	![DARE Banner](https://cdn-uploads.huggingface.co/production/uploads/64c0e071e9263c783d548178/xXKYApaqL9hZyfSeSN3zP.png)

	DARE (Distribution-Aware Retrieval Embedding) is a specialized bi-encoder model designed to retrieve statistical and data analysis tools (R functions) based on both user queries and conditional on data profile.

	It is fine-tuned from `sentence-transformers/all-MiniLM-L6-v2` to serve as a high-precision tool retrieval module for Large Language Model (LLM) Agents in automated data science workflows.

	- Paper: [DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval](https://huggingface.co/papers/2603.04743)
	- Repository: [GitHub](https://github.com/AMA-CMFAI/DARE)
	- Project Page: [DARE Webpage](https://ama-cmfai.github.io/DARE_webpage/)

	## Model Details
	- Architecture: Bi-encoder (Sentence Transformer)
	- Base Model: `sentence-transformers/all-MiniLM-L6-v2` (22.7M parameters)
	- Task: Dense Retrieval for Tool-Augmented LLMs
	- Performance: SoTA on R package retrieval tasks (93.47% NDCG@10).
	- Domain: R programming language, Data Science, Statistical Analysis functions

	### Usage (Sentence-Transformers)

	First, install the `sentence-transformers` library:
	```bash
	pip install -U sentence-transformers
	```

	### Usage with RPKB (Recommended)
	Download the [R Package Knowledge Base (RPKB)](https://huggingface.co/datasets/Stephen-SMJ/RPKB) to perform conditional retrieval.

	```python
	from huggingface_hub import snapshot_download
	import chromadb
	import os

	# 1. Download the database folder from Hugging Face
	db_path = snapshot_download(
	repo_id="Stephen-SMJ/RPKB",
	repo_type="dataset",
	allow_patterns="RPKB/*"
	)

	# 2. Connect to the local ChromaDB instance
	client = chromadb.PersistentClient(path=f"{db_path}/RPKB")

	# 3. Access the specific collection
	collection = client.get_collection(name="inference")

	print(f"✅ Loaded {collection.count()} R functions ready for conditional retrieval!")
	```

	### Retrieval with DARE
	```python
	from sentence_transformers import SentenceTransformer

	# 1. Load the DARE model
	model = SentenceTransformer("Stephen-SMJ/DARE-R-Retrieval")

	# 2. Define the exact input format: Query + Data Profile
	query = "I have a high-dimensional genomic dataset named hidra_ex_1_2000.csv in my environment. I need to identify driver elements by estimating regulatory scores based on the counts provided
	in the data. Please set the random seed to 123 at the start. I need to filter for fragment lengths between 150 and 600 bp and use a DNA count filter of 5. For my evaluation, please print the
	first value of the estimated scores (est_a) for the very first region identified."

	# 3. Generate embedding
	query_embedding = model.encode(query).tolist()

	# 4. Search in the database
	results = collection.query(
	query_embeddings=[query_embedding],
	n_results=3,
	include=["metadatas", "distances", "documents"]
	)

	# Display Top-1 Result
	print("Top-1 Function:", results["metadatas"][0][0]["package_name"], "::", results["metadatas"][0][0]["function_name"])
	```

	## Citation

	If you find DARE, RPKB, or RCodingAgent useful in your research, please cite:

	```bibtex
	@article{sun2026dare,
	title={DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval},
	author={Maojun Sun and Yue Wu and Yifei Xie and Ruijian Han and Binyan Jiang and Defeng Sun and Yancheng Yuan and Jian Huang},
	year={2026},
	eprint={2603.04743},
	archivePrefix={arXiv},
	primaryClass={cs.IR},
	url={https://arxiv.org/abs/2603.04743},
	}
	```