ActiveUltraFeedback: Sample-Efficient RLHF Preference data generation using Active Learning

university

Activity Feed

AI & ML interests

NLP, LLM Alignment, Preference Data generation, Active Learning

Recent Activity

davmel updated a Space about 8 hours ago

ActiveUltraFeedback/README

MariSchn authored a paper about 13 hours ago

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

davmel updated a Space 2 months ago

ActiveUltraFeedback/README

View all activity

Organization Card

Community About org cards

This repo accompanies the paper: ActiveUltraFeedback — arXiv:2603.09692.

Active UltraFeedback is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage uncertainty quantification and active learning to annotate only the most informative samples, drastically reducing costs while beating standard baselines.

Repository Purpose: This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.

🏆 Benchmark Results (Click to Expand)

Our experiments demonstrate that Active Learning strategies (specifically DRTS and DeltaUCB) consistently beat the actual ultrafeedback_binarized_cleaned and tulu3 preference mixture datasets.

1. UltraFeedback Prompts (Only)

Reward Model (RM) Performance

Method	Factuality	Focus	Math	Precise IF	Safety	Ties	Mean
Baselines
Random	+0.443	+0.209	+0.156	+0.133	+0.417	+0.310	+0.278
UltraFeedback	+0.443	+0.188	+0.213	+0.114	+0.481	+0.284	+0.287
MaxMin	+0.377	+0.483	+0.156	+0.123	+0.370	+0.400	+0.318
DeltaQwen	+0.195	-0.034	+0.028	+0.067	+0.216	+0.126	+0.100
Ours
DRTS	+0.412	+0.408	+0.183	+0.114	+0.347	+0.404	+0.312
DeltaUCB	+0.423	+0.553	+0.132	+0.080	+0.435	+0.408	+0.339
DTS	+0.406	+0.024	+0.194	+0.077	+0.441	+0.197	+0.223
InfoMax	+0.463	+0.287	+0.096	+0.129	+0.509	+0.296	+0.297
MaxMinLCB	+0.390	-0.025	+0.244	+0.070	+0.453	+0.250	+0.230

DPO Performance

Method	GSM8K	IF Eval	Truthful QA	Alpaca Eval	Mean
Baselines
Random	+0.024	+0.028	+0.056	+0.077	+0.046
UltraFeedback	+0.037	-0.001	+0.039	+0.072	+0.036
MaxMin	+0.022	-0.016	+0.150	+0.289	+0.111
DeltaQwen	+0.055	+0.047	+0.130	+0.316	+0.137
Ours
DRTS	+0.055	+0.050	+0.143	+0.259	+0.127
DeltaUCB	+0.065	+0.039	+0.113	+0.254	+0.117
DTS	+0.011	+0.034	+0.013	+0.037	+0.023
InfoMax	+0.011	+0.019	+0.018	+0.020	+0.016
MaxMinLCB	+0.015	+0.017	+0.006	+0.027	+0.016

2. Skywork Prompts (Only)

Reward Model (RM) Performance

Method	Factuality	Focus	Math	Precise IF	Safety	Ties	Mean
Baselines
Random	+0.407	+0.106	+0.151	+0.092	+0.422	+0.157	+0.223
UltraFeedback	+0.419	+0.068	+0.189	+0.058	+0.440	+0.228	+0.234
MaxMin	+0.410	+0.462	+0.172	+0.055	+0.531	+0.319	+0.325
DeltaQwen	+0.238	-0.023	+0.011	+0.108	+0.306	+0.132	+0.129
Ours
DRTS	+0.423	+0.233	+0.164	+0.055	+0.377	+0.285	+0.256
DeltaUCB	+0.370	+0.319	+0.194	+0.033	+0.346	+0.310	+0.262
DTS	+0.417	-0.021	+0.148	+0.077	+0.450	+0.245	+0.219
InfoMax	+0.429	+0.122	+0.162	+0.030	+0.495	+0.227	+0.244
MaxMinLCB	+0.371	-0.016	+0.145	+0.039	+0.395	+0.167	+0.184

DPO Performance

Method	GSM8K	IF Eval	Truthful QA	Alpaca Eval	Mean
Baselines
Random	+0.012	+0.015	+0.045	+0.063	+0.033
UltraFeedback	+0.027	+0.054	+0.043	+0.071	+0.048
MaxMin	+0.049	-0.011	+0.128	+0.270	+0.108
DeltaQwen	+0.058	+0.002	+0.152	+0.384	+0.149
Ours
DRTS	+0.052	+0.012	+0.114	+0.229	+0.101
DeltaUCB	+0.055	+0.013	+0.077	+0.238	+0.095
DTS	+0.008	+0.002	+0.011	+0.021	+0.010
InfoMax	+0.021	+0.002	+0.011	+0.013	+0.012
MaxMinLCB	+0.003	+0.010	+0.004	+0.018	+0.008

3. Skywork + UltraFeedback (Combined)

Reward Model (RM) Performance

Method	Factuality	Focus	Math	Precise IF	Safety	Ties	Mean
Baselines
Random	+0.455	+0.216	+0.205	+0.077	+0.466	+0.193	+0.269
UltraFeedback	+0.407	+0.114	+0.175	+0.064	+0.433	+0.247	+0.240
MaxMin	+0.410	+0.467	+0.194	+0.083	+0.412	+0.380	+0.325
DeltaQwen	+0.242	-0.007	+0.009	+0.151	+0.279	+0.241	+0.153
Ours
DRTS	+0.427	+0.436	+0.156	+0.086	+0.475	+0.272	+0.309
DeltaUCB	+0.463	+0.350	+0.164	+0.092	+0.469	+0.213	+0.292
DTS	+0.419	+0.087	+0.186	+0.083	+0.411	+0.297	+0.247
InfoMax	+0.476	+0.383	+0.153	+0.042	+0.546	+0.199	+0.300
MaxMinLCB	+0.439	+0.048	+0.159	+0.030	+0.435	+0.201	+0.219

DPO Performance

Method	GSM8K	IF Eval	Truthful QA	Alpaca Eval	Mean
Baselines
Random	+0.024	+0.028	+0.056	+0.077	+0.046
UltraFeedback	+0.037	-0.001	+0.039	+0.072	+0.036
MaxMin	+0.022	-0.016	+0.150	+0.289	+0.111
DeltaQwen	+0.055	+0.047	+0.130	+0.316	+0.137
Ours
DRTS	+0.055	+0.015	+0.108	+0.177	+0.088
DeltaUCB	+0.049	+0.039	+0.117	+0.217	+0.105
DTS	+0.009	+0.002	+0.014	+0.029	+0.013
InfoMax	+0.011	+0.021	+0.014	+0.018	+0.015
MaxMinLCB	-0.010	+0.019	+0.010	+0.021	+0.009

4. Tulu 3 Prompts

Reward Model (RM) Performance

Method	Factuality	Focus	Math	Precise IF	Safety	Ties	Mean
Baselines
Random	+0.465	+0.465	+0.213	+0.077	+0.584	+0.355	+0.360
UltraFeedback	+0.450	+0.441	+0.170	+0.077	+0.531	+0.386	+0.343
MaxMin	+0.450	+0.443	+0.211	+0.083	+0.521	+0.358	+0.344
DeltaQwen	+0.179	-0.086	-0.013	+0.164	+0.174	+0.091	+0.085
Tulu3_PrefMix	+0.398	+0.350	+0.173	+0.098	+0.423	+0.342	+0.298
Ours
DRTS	+0.456	+0.515	+0.080	+0.148	+0.533	+0.356	+0.348
DeltaUCB	+0.455	+0.537	+0.189	+0.148	+0.580	+0.390	+0.383
DTS	+0.426	+0.140	+0.200	+0.036	+0.499	+0.160	+0.243
InfoMax	+0.431	+0.302	+0.175	+0.098	+0.545	+0.286	+0.306
MaxMinLCB	+0.448	+0.168	+0.140	+0.101	+0.531	+0.196	+0.264

DPO Performance

Method	GSM8K	IF Eval	Truthful QA	Alpaca Eval	Mean
Baselines
Random	+0.055	+0.041	+0.069	+0.046	+0.052
UltraFeedback	+0.043	+0.052	+0.056	+0.057	+0.051
MaxMin	+0.022	+0.067	+0.188	+0.279	+0.138
DeltaQwen	+0.049	+0.034	+0.124	+0.291	+0.124
Tulu3_PrefMix	+0.037	+0.069	+0.046	+0.020	+0.042
Ours
DRTS	+0.050	+0.058	+0.118	+0.203	+0.107
DeltaUCB	+0.028	+0.060	+0.134	+0.235	+0.114
DTS	+0.015	+0.012	+0.018	+0.024	+0.017
InfoMax	+0.021	+0.008	+0.039	+0.012	+0.020
MaxMinLCB	+0.013	-0.014	+0.012	+0.019	+0.008

🔁 Pipeline Overview (How it works)

Given a batch of prompts, the following steps are executed:

Response Generation: For each prompt in the batch, call multiple LLMs to each generate a response.
Uncertainty-Aware Reward Prediction: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. (Note: This model is initialized randomly at the start).
Pair Selection (Acquisition Function): Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
Oracle Annotation: Annotate which response in the selected pair is preferred (via LLM or human).
Reward Model Training: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.

🤖 Source Models & Licenses

Model	Parameters (B)	License
Qwen
`Qwen/Qwen2.5-0.5B-Instruct`	0.5	Apache 2.0
`Qwen/Qwen2.5-72B-Instruct`	72	Qwen
`Qwen/Qwen3-0.6B`	0.6	Apache 2.0
`Qwen/Qwen3-1.7B`	1.7	Apache 2.0
`Qwen/Qwen3-14B`	14	Apache 2.0
`Qwen/Qwen3-30B-A3B`	30	Apache 2.0
`Qwen/Qwen3-32B`	32	Apache 2.0
`Qwen/Qwen3-235B-A22B`	234	Apache 2.0
Llama
`meta-llama/Llama-3.1-8B-Instruct`	8	Llama 3
`meta-llama/Llama-3.2-1B-Instruct`	1	Llama 3
`meta-llama/Llama-3.2-3B-Instruct`	3	Llama 3
`meta-llama/Llama-3.3-70B-Instruct`	70	Llama 3
Microsoft
`microsoft/Phi-4-mini-instruct`	4	MIT
`microsoft/phi-4`	14	MIT
Mistral
`mistralai/Mistral-Small-24B-Instruct-2501`	23	Apache 2.0
`mistralai/Mistral-Large-Instruct-2411`	123	MRL
NVIDIA
`nvidia/Llama-3.1-Nemotron-70B-Instruct-HF`	70	Llama 3
`nvidia/Llama-3_3-Nemotron-Super-49B-v1`	49	Nvidia Open Model
`nvidia/Llama-3_1-Nemotron-Ultra-253B-v1`	253	Nvidia Open Model
Gemma
`google/gemma-3-1b-it`	1	Gemma
`google/gemma-3-4b-it`	4	Gemma
`google/gemma-3-12b-it`	12	Gemma
`google/gemma-3-27b-it`	27	Gemma
AllenAI
`allenai/OLMo-2-0325-32B-Instruct`	32	Apache 2.0
`allenai/Llama-3.1-Tulu-3-70B`	70	Llama 3
`allenai/Llama-3.1-Tulu-3-405B`	405	Llama 3
Other
`HuggingFaceTB/SmolLM2-1.7B-Instruct`	1.7	Apache 2.0
`moonshotai/Moonlight-16B-A3B-Instruct`	16	MIT
`CohereLabs/c4ai-command-a-03-2025`	111	CC by NC 4.0
`deepseek-ai/DeepSeek-V3`	671	Deepseek

---

License: MIT
Disclaimer: The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets.

Citation

If you use our work or the ActiveUltraFeedback datasets, models, please cite us:

@misc{melikidze2026activeultrafeedbackefficientpreferencedata,
      title={ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning}, 
      author={Davit Melikidze and Marian Schneider and Jessica Lam and Martin Wertich and Ido Hakimi and Barna Pásztor and Andreas Krause},
      year={2026},
      eprint={2603.09692},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.09692}, 
}

models 0

None public yet

datasets 0

None public yet

AI & ML interests

Recent Activity

Team members 2

This repo accompanies the paper: ActiveUltraFeedback — arXiv:2603.09692.

1. UltraFeedback Prompts (Only)

2. Skywork Prompts (Only)

3. Skywork + UltraFeedback (Combined)

4. Tulu 3 Prompts

Citation

models 0

datasets 0