AI & ML interests
NLP, LLM Alignment, Preference Data generation, Active Learning
Recent Activity
This repo accompanies the paper: ActiveUltraFeedback — arXiv:2603.09692.
Active UltraFeedback is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage uncertainty quantification and active learning to annotate only the most informative samples, drastically reducing costs while beating standard baselines.
Repository Purpose: This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.
🏆 Benchmark Results (Click to Expand)
Our experiments demonstrate that Active Learning strategies (specifically DRTS and DeltaUCB) consistently beat the actual ultrafeedback_binarized_cleaned and tulu3 preference mixture datasets.
1. UltraFeedback Prompts (Only)
Reward Model (RM) Performance
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | Mean |
|---|---|---|---|---|---|---|---|
| Baselines | |||||||
| Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
| UltraFeedback | +0.443 | +0.188 | +0.213 | +0.114 | +0.481 | +0.284 | +0.287 |
| MaxMin | +0.377 | +0.483 | +0.156 | +0.123 | +0.370 | +0.400 | +0.318 |
| DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
| Ours | |||||||
| DRTS | +0.412 | +0.408 | +0.183 | +0.114 | +0.347 | +0.404 | +0.312 |
| DeltaUCB | +0.423 | +0.553 | +0.132 | +0.080 | +0.435 | +0.408 | +0.339 |
| DTS | +0.406 | +0.024 | +0.194 | +0.077 | +0.441 | +0.197 | +0.223 |
| InfoMax | +0.463 | +0.287 | +0.096 | +0.129 | +0.509 | +0.296 | +0.297 |
| MaxMinLCB | +0.390 | -0.025 | +0.244 | +0.070 | +0.453 | +0.250 | +0.230 |
DPO Performance
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | Mean |
|---|---|---|---|---|---|
| Baselines | |||||
| Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 |
| UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 |
| MaxMin | +0.022 | -0.016 | +0.150 | +0.289 | +0.111 |
| DeltaQwen | +0.055 | +0.047 | +0.130 | +0.316 | +0.137 |
| Ours | |||||
| DRTS | +0.055 | +0.050 | +0.143 | +0.259 | +0.127 |
| DeltaUCB | +0.065 | +0.039 | +0.113 | +0.254 | +0.117 |
| DTS | +0.011 | +0.034 | +0.013 | +0.037 | +0.023 |
| InfoMax | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
| MaxMinLCB | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |
2. Skywork Prompts (Only)
Reward Model (RM) Performance
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | Mean |
|---|---|---|---|---|---|---|---|
| Baselines | |||||||
| Random | +0.407 | +0.106 | +0.151 | +0.092 | +0.422 | +0.157 | +0.223 |
| UltraFeedback | +0.419 | +0.068 | +0.189 | +0.058 | +0.440 | +0.228 | +0.234 |
| MaxMin | +0.410 | +0.462 | +0.172 | +0.055 | +0.531 | +0.319 | +0.325 |
| DeltaQwen | +0.238 | -0.023 | +0.011 | +0.108 | +0.306 | +0.132 | +0.129 |
| Ours | |||||||
| DRTS | +0.423 | +0.233 | +0.164 | +0.055 | +0.377 | +0.285 | +0.256 |
| DeltaUCB | +0.370 | +0.319 | +0.194 | +0.033 | +0.346 | +0.310 | +0.262 |
| DTS | +0.417 | -0.021 | +0.148 | +0.077 | +0.450 | +0.245 | +0.219 |
| InfoMax | +0.429 | +0.122 | +0.162 | +0.030 | +0.495 | +0.227 | +0.244 |
| MaxMinLCB | +0.371 | -0.016 | +0.145 | +0.039 | +0.395 | +0.167 | +0.184 |
DPO Performance
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | Mean |
|---|---|---|---|---|---|
| Baselines | |||||
| Random | +0.012 | +0.015 | +0.045 | +0.063 | +0.033 |
| UltraFeedback | +0.027 | +0.054 | +0.043 | +0.071 | +0.048 |
| MaxMin | +0.049 | -0.011 | +0.128 | +0.270 | +0.108 |
| DeltaQwen | +0.058 | +0.002 | +0.152 | +0.384 | +0.149 |
| Ours | |||||
| DRTS | +0.052 | +0.012 | +0.114 | +0.229 | +0.101 |
| DeltaUCB | +0.055 | +0.013 | +0.077 | +0.238 | +0.095 |
| DTS | +0.008 | +0.002 | +0.011 | +0.021 | +0.010 |
| InfoMax | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 |
| MaxMinLCB | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 |
3. Skywork + UltraFeedback (Combined)
Reward Model (RM) Performance
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | Mean |
|---|---|---|---|---|---|---|---|
| Baselines | |||||||
| Random | +0.455 | +0.216 | +0.205 | +0.077 | +0.466 | +0.193 | +0.269 |
| UltraFeedback | +0.407 | +0.114 | +0.175 | +0.064 | +0.433 | +0.247 | +0.240 |
| MaxMin | +0.410 | +0.467 | +0.194 | +0.083 | +0.412 | +0.380 | +0.325 |
| DeltaQwen | +0.242 | -0.007 | +0.009 | +0.151 | +0.279 | +0.241 | +0.153 |
| Ours | |||||||
| DRTS | +0.427 | +0.436 | +0.156 | +0.086 | +0.475 | +0.272 | +0.309 |
| DeltaUCB | +0.463 | +0.350 | +0.164 | +0.092 | +0.469 | +0.213 | +0.292 |
| DTS | +0.419 | +0.087 | +0.186 | +0.083 | +0.411 | +0.297 | +0.247 |
| InfoMax | +0.476 | +0.383 | +0.153 | +0.042 | +0.546 | +0.199 | +0.300 |
| MaxMinLCB | +0.439 | +0.048 | +0.159 | +0.030 | +0.435 | +0.201 | +0.219 |
DPO Performance
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | Mean |
|---|---|---|---|---|---|
| Baselines | |||||
| Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 |
| UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 |
| MaxMin | +0.022 | -0.016 | +0.150 | +0.289 | +0.111 |
| DeltaQwen | +0.055 | +0.047 | +0.130 | +0.316 | +0.137 |
| Ours | |||||
| DRTS | +0.055 | +0.015 | +0.108 | +0.177 | +0.088 |
| DeltaUCB | +0.049 | +0.039 | +0.117 | +0.217 | +0.105 |
| DTS | +0.009 | +0.002 | +0.014 | +0.029 | +0.013 |
| InfoMax | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 |
| MaxMinLCB | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 |
4. Tulu 3 Prompts
Reward Model (RM) Performance
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | Mean |
|---|---|---|---|---|---|---|---|
| Baselines | |||||||
| Random | +0.465 | +0.465 | +0.213 | +0.077 | +0.584 | +0.355 | +0.360 |
| UltraFeedback | +0.450 | +0.441 | +0.170 | +0.077 | +0.531 | +0.386 | +0.343 |
| MaxMin | +0.450 | +0.443 | +0.211 | +0.083 | +0.521 | +0.358 | +0.344 |
| DeltaQwen | +0.179 | -0.086 | -0.013 | +0.164 | +0.174 | +0.091 | +0.085 |
| Tulu3_PrefMix | +0.398 | +0.350 | +0.173 | +0.098 | +0.423 | +0.342 | +0.298 |
| Ours | |||||||
| DRTS | +0.456 | +0.515 | +0.080 | +0.148 | +0.533 | +0.356 | +0.348 |
| DeltaUCB | +0.455 | +0.537 | +0.189 | +0.148 | +0.580 | +0.390 | +0.383 |
| DTS | +0.426 | +0.140 | +0.200 | +0.036 | +0.499 | +0.160 | +0.243 |
| InfoMax | +0.431 | +0.302 | +0.175 | +0.098 | +0.545 | +0.286 | +0.306 |
| MaxMinLCB | +0.448 | +0.168 | +0.140 | +0.101 | +0.531 | +0.196 | +0.264 |
DPO Performance
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | Mean |
|---|---|---|---|---|---|
| Baselines | |||||
| Random | +0.055 | +0.041 | +0.069 | +0.046 | +0.052 |
| UltraFeedback | +0.043 | +0.052 | +0.056 | +0.057 | +0.051 |
| MaxMin | +0.022 | +0.067 | +0.188 | +0.279 | +0.138 |
| DeltaQwen | +0.049 | +0.034 | +0.124 | +0.291 | +0.124 |
| Tulu3_PrefMix | +0.037 | +0.069 | +0.046 | +0.020 | +0.042 |
| Ours | |||||
| DRTS | +0.050 | +0.058 | +0.118 | +0.203 | +0.107 |
| DeltaUCB | +0.028 | +0.060 | +0.134 | +0.235 | +0.114 |
| DTS | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
| InfoMax | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
| MaxMinLCB | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
🔁 Pipeline Overview (How it works)
Given a batch of prompts, the following steps are executed:
- Response Generation: For each prompt in the batch, call multiple LLMs to each generate a response.
- Uncertainty-Aware Reward Prediction: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. (Note: This model is initialized randomly at the start).
- Pair Selection (Acquisition Function): Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
- Oracle Annotation: Annotate which response in the selected pair is preferred (via LLM or human).
- Reward Model Training: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
🤖 Source Models & Licenses
| Model | Parameters (B) | License |
|---|---|---|
| Qwen | ||
Qwen/Qwen2.5-0.5B-Instruct |
0.5 | Apache 2.0 |
Qwen/Qwen2.5-72B-Instruct |
72 | Qwen |
Qwen/Qwen3-0.6B |
0.6 | Apache 2.0 |
Qwen/Qwen3-1.7B |
1.7 | Apache 2.0 |
Qwen/Qwen3-14B |
14 | Apache 2.0 |
Qwen/Qwen3-30B-A3B |
30 | Apache 2.0 |
Qwen/Qwen3-32B |
32 | Apache 2.0 |
Qwen/Qwen3-235B-A22B |
234 | Apache 2.0 |
| Llama | ||
meta-llama/Llama-3.1-8B-Instruct |
8 | Llama 3 |
meta-llama/Llama-3.2-1B-Instruct |
1 | Llama 3 |
meta-llama/Llama-3.2-3B-Instruct |
3 | Llama 3 |
meta-llama/Llama-3.3-70B-Instruct |
70 | Llama 3 |
| Microsoft | ||
microsoft/Phi-4-mini-instruct |
4 | MIT |
microsoft/phi-4 |
14 | MIT |
| Mistral | ||
mistralai/Mistral-Small-24B-Instruct-2501 |
23 | Apache 2.0 |
mistralai/Mistral-Large-Instruct-2411 |
123 | MRL |
| NVIDIA | ||
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF |
70 | Llama 3 |
nvidia/Llama-3_3-Nemotron-Super-49B-v1 |
49 | Nvidia Open Model |
nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 |
253 | Nvidia Open Model |
| Gemma | ||
google/gemma-3-1b-it |
1 | Gemma |
google/gemma-3-4b-it |
4 | Gemma |
google/gemma-3-12b-it |
12 | Gemma |
google/gemma-3-27b-it |
27 | Gemma |
| AllenAI | ||
allenai/OLMo-2-0325-32B-Instruct |
32 | Apache 2.0 |
allenai/Llama-3.1-Tulu-3-70B |
70 | Llama 3 |
allenai/Llama-3.1-Tulu-3-405B |
405 | Llama 3 |
| Other | ||
HuggingFaceTB/SmolLM2-1.7B-Instruct |
1.7 | Apache 2.0 |
moonshotai/Moonlight-16B-A3B-Instruct |
16 | MIT |
CohereLabs/c4ai-command-a-03-2025 |
111 | CC by NC 4.0 |
deepseek-ai/DeepSeek-V3 |
671 | Deepseek |
License: MIT
Disclaimer: The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets.
Citation
If you use our work or the ActiveUltraFeedback datasets, models, please cite us:
@misc{melikidze2026activeultrafeedbackefficientpreferencedata,
title={ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning},
author={Davit Melikidze and Marian Schneider and Jessica Lam and Martin Wertich and Ido Hakimi and Barna Pásztor and Andreas Krause},
year={2026},
eprint={2603.09692},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.09692},
}