AI & ML interests

NLP, LLM Alignment, Preference Data generation, Active Learning

Recent Activity

Organization Card

This repo accompanies the paper: ActiveUltraFeedback — arXiv:2603.09692.

Active UltraFeedback is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage uncertainty quantification and active learning to annotate only the most informative samples, drastically reducing costs while beating standard baselines.

Repository Purpose: This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.

🏆 Benchmark Results (Click to Expand)

Our experiments demonstrate that Active Learning strategies (specifically DRTS and DeltaUCB) consistently beat the actual ultrafeedback_binarized_cleaned and tulu3 preference mixture datasets.

1. UltraFeedback Prompts (Only)

Reward Model (RM) Performance

Method Factuality Focus Math Precise IF Safety Ties Mean
Baselines
Random +0.443 +0.209 +0.156 +0.133 +0.417 +0.310 +0.278
UltraFeedback +0.443 +0.188 +0.213 +0.114 +0.481 +0.284 +0.287
MaxMin +0.377 +0.483 +0.156 +0.123 +0.370 +0.400 +0.318
DeltaQwen +0.195 -0.034 +0.028 +0.067 +0.216 +0.126 +0.100
Ours
DRTS +0.412 +0.408 +0.183 +0.114 +0.347 +0.404 +0.312
DeltaUCB +0.423 +0.553 +0.132 +0.080 +0.435 +0.408 +0.339
DTS +0.406 +0.024 +0.194 +0.077 +0.441 +0.197 +0.223
InfoMax +0.463 +0.287 +0.096 +0.129 +0.509 +0.296 +0.297
MaxMinLCB +0.390 -0.025 +0.244 +0.070 +0.453 +0.250 +0.230

DPO Performance

Method GSM8K IF Eval Truthful QA Alpaca Eval Mean
Baselines
Random +0.024 +0.028 +0.056 +0.077 +0.046
UltraFeedback +0.037 -0.001 +0.039 +0.072 +0.036
MaxMin +0.022 -0.016 +0.150 +0.289 +0.111
DeltaQwen +0.055 +0.047 +0.130 +0.316 +0.137
Ours
DRTS +0.055 +0.050 +0.143 +0.259 +0.127
DeltaUCB +0.065 +0.039 +0.113 +0.254 +0.117
DTS +0.011 +0.034 +0.013 +0.037 +0.023
InfoMax +0.011 +0.019 +0.018 +0.020 +0.016
MaxMinLCB +0.015 +0.017 +0.006 +0.027 +0.016

2. Skywork Prompts (Only)

Reward Model (RM) Performance

Method Factuality Focus Math Precise IF Safety Ties Mean
Baselines
Random +0.407 +0.106 +0.151 +0.092 +0.422 +0.157 +0.223
UltraFeedback +0.419 +0.068 +0.189 +0.058 +0.440 +0.228 +0.234
MaxMin +0.410 +0.462 +0.172 +0.055 +0.531 +0.319 +0.325
DeltaQwen +0.238 -0.023 +0.011 +0.108 +0.306 +0.132 +0.129
Ours
DRTS +0.423 +0.233 +0.164 +0.055 +0.377 +0.285 +0.256
DeltaUCB +0.370 +0.319 +0.194 +0.033 +0.346 +0.310 +0.262
DTS +0.417 -0.021 +0.148 +0.077 +0.450 +0.245 +0.219
InfoMax +0.429 +0.122 +0.162 +0.030 +0.495 +0.227 +0.244
MaxMinLCB +0.371 -0.016 +0.145 +0.039 +0.395 +0.167 +0.184

DPO Performance

Method GSM8K IF Eval Truthful QA Alpaca Eval Mean
Baselines
Random +0.012 +0.015 +0.045 +0.063 +0.033
UltraFeedback +0.027 +0.054 +0.043 +0.071 +0.048
MaxMin +0.049 -0.011 +0.128 +0.270 +0.108
DeltaQwen +0.058 +0.002 +0.152 +0.384 +0.149
Ours
DRTS +0.052 +0.012 +0.114 +0.229 +0.101
DeltaUCB +0.055 +0.013 +0.077 +0.238 +0.095
DTS +0.008 +0.002 +0.011 +0.021 +0.010
InfoMax +0.021 +0.002 +0.011 +0.013 +0.012
MaxMinLCB +0.003 +0.010 +0.004 +0.018 +0.008

3. Skywork + UltraFeedback (Combined)

Reward Model (RM) Performance

Method Factuality Focus Math Precise IF Safety Ties Mean
Baselines
Random +0.455 +0.216 +0.205 +0.077 +0.466 +0.193 +0.269
UltraFeedback +0.407 +0.114 +0.175 +0.064 +0.433 +0.247 +0.240
MaxMin +0.410 +0.467 +0.194 +0.083 +0.412 +0.380 +0.325
DeltaQwen +0.242 -0.007 +0.009 +0.151 +0.279 +0.241 +0.153
Ours
DRTS +0.427 +0.436 +0.156 +0.086 +0.475 +0.272 +0.309
DeltaUCB +0.463 +0.350 +0.164 +0.092 +0.469 +0.213 +0.292
DTS +0.419 +0.087 +0.186 +0.083 +0.411 +0.297 +0.247
InfoMax +0.476 +0.383 +0.153 +0.042 +0.546 +0.199 +0.300
MaxMinLCB +0.439 +0.048 +0.159 +0.030 +0.435 +0.201 +0.219

DPO Performance

Method GSM8K IF Eval Truthful QA Alpaca Eval Mean
Baselines
Random +0.024 +0.028 +0.056 +0.077 +0.046
UltraFeedback +0.037 -0.001 +0.039 +0.072 +0.036
MaxMin +0.022 -0.016 +0.150 +0.289 +0.111
DeltaQwen +0.055 +0.047 +0.130 +0.316 +0.137
Ours
DRTS +0.055 +0.015 +0.108 +0.177 +0.088
DeltaUCB +0.049 +0.039 +0.117 +0.217 +0.105
DTS +0.009 +0.002 +0.014 +0.029 +0.013
InfoMax +0.011 +0.021 +0.014 +0.018 +0.015
MaxMinLCB -0.010 +0.019 +0.010 +0.021 +0.009

4. Tulu 3 Prompts

Reward Model (RM) Performance

Method Factuality Focus Math Precise IF Safety Ties Mean
Baselines
Random +0.465 +0.465 +0.213 +0.077 +0.584 +0.355 +0.360
UltraFeedback +0.450 +0.441 +0.170 +0.077 +0.531 +0.386 +0.343
MaxMin +0.450 +0.443 +0.211 +0.083 +0.521 +0.358 +0.344
DeltaQwen +0.179 -0.086 -0.013 +0.164 +0.174 +0.091 +0.085
Tulu3_PrefMix +0.398 +0.350 +0.173 +0.098 +0.423 +0.342 +0.298
Ours
DRTS +0.456 +0.515 +0.080 +0.148 +0.533 +0.356 +0.348
DeltaUCB +0.455 +0.537 +0.189 +0.148 +0.580 +0.390 +0.383
DTS +0.426 +0.140 +0.200 +0.036 +0.499 +0.160 +0.243
InfoMax +0.431 +0.302 +0.175 +0.098 +0.545 +0.286 +0.306
MaxMinLCB +0.448 +0.168 +0.140 +0.101 +0.531 +0.196 +0.264

DPO Performance

Method GSM8K IF Eval Truthful QA Alpaca Eval Mean
Baselines
Random +0.055 +0.041 +0.069 +0.046 +0.052
UltraFeedback +0.043 +0.052 +0.056 +0.057 +0.051
MaxMin +0.022 +0.067 +0.188 +0.279 +0.138
DeltaQwen +0.049 +0.034 +0.124 +0.291 +0.124
Tulu3_PrefMix +0.037 +0.069 +0.046 +0.020 +0.042
Ours
DRTS +0.050 +0.058 +0.118 +0.203 +0.107
DeltaUCB +0.028 +0.060 +0.134 +0.235 +0.114
DTS +0.015 +0.012 +0.018 +0.024 +0.017
InfoMax +0.021 +0.008 +0.039 +0.012 +0.020
MaxMinLCB +0.013 -0.014 +0.012 +0.019 +0.008
🔁 Pipeline Overview (How it works)

Given a batch of prompts, the following steps are executed:

  1. Response Generation: For each prompt in the batch, call multiple LLMs to each generate a response.
  2. Uncertainty-Aware Reward Prediction: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. (Note: This model is initialized randomly at the start).
  3. Pair Selection (Acquisition Function): Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
  4. Oracle Annotation: Annotate which response in the selected pair is preferred (via LLM or human).
  5. Reward Model Training: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
🤖 Source Models & Licenses
Model Parameters (B) License
Qwen
Qwen/Qwen2.5-0.5B-Instruct 0.5 Apache 2.0
Qwen/Qwen2.5-72B-Instruct 72 Qwen
Qwen/Qwen3-0.6B 0.6 Apache 2.0
Qwen/Qwen3-1.7B 1.7 Apache 2.0
Qwen/Qwen3-14B 14 Apache 2.0
Qwen/Qwen3-30B-A3B 30 Apache 2.0
Qwen/Qwen3-32B 32 Apache 2.0
Qwen/Qwen3-235B-A22B 234 Apache 2.0
Llama
meta-llama/Llama-3.1-8B-Instruct 8 Llama 3
meta-llama/Llama-3.2-1B-Instruct 1 Llama 3
meta-llama/Llama-3.2-3B-Instruct 3 Llama 3
meta-llama/Llama-3.3-70B-Instruct 70 Llama 3
Microsoft
microsoft/Phi-4-mini-instruct 4 MIT
microsoft/phi-4 14 MIT
Mistral
mistralai/Mistral-Small-24B-Instruct-2501 23 Apache 2.0
mistralai/Mistral-Large-Instruct-2411 123 MRL
NVIDIA
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF 70 Llama 3
nvidia/Llama-3_3-Nemotron-Super-49B-v1 49 Nvidia Open Model
nvidia/Llama-3_1-Nemotron-Ultra-253B-v1 253 Nvidia Open Model
Gemma
google/gemma-3-1b-it 1 Gemma
google/gemma-3-4b-it 4 Gemma
google/gemma-3-12b-it 12 Gemma
google/gemma-3-27b-it 27 Gemma
AllenAI
allenai/OLMo-2-0325-32B-Instruct 32 Apache 2.0
allenai/Llama-3.1-Tulu-3-70B 70 Llama 3
allenai/Llama-3.1-Tulu-3-405B 405 Llama 3
Other
HuggingFaceTB/SmolLM2-1.7B-Instruct 1.7 Apache 2.0
moonshotai/Moonlight-16B-A3B-Instruct 16 MIT
CohereLabs/c4ai-command-a-03-2025 111 CC by NC 4.0
deepseek-ai/DeepSeek-V3 671 Deepseek
---

License: MIT
Disclaimer: The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets.

Citation

If you use our work or the ActiveUltraFeedback datasets, models, please cite us:

@misc{melikidze2026activeultrafeedbackefficientpreferencedata,
      title={ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning}, 
      author={Davit Melikidze and Marian Schneider and Jessica Lam and Martin Wertich and Ido Hakimi and Barna Pásztor and Andreas Krause},
      year={2026},
      eprint={2603.09692},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.09692}, 
}

models 0

None public yet

datasets 0

None public yet