# SDFT

Self-Distilled Fine-Tuning (SDFT) is described in [Self-Training with On-Policy Self-Distillation for Language Model Alignment](https://huggingface.co/papers/2601.19897).

The TRL implementation adapts SDFT to the experimental trainer API while reusing the shared self-distillation infrastructure also used by SDPO.

In the current TRL implementation:

- the teacher is the model itself (base weights with adapter disabled for PEFT, or the same model under `no_grad` for non-PEFT); use `sync_ref_model=True` for an EMA teacher
- the dataset must provide both `prompt` and `privileged_context`
- `privileged_context` contains only the extra teacher-only information; the trainer combines it with `prompt` to build the teacher prompt
- `teacher_prompt_template` controls how `prompt` and `privileged_context` are combined into the teacher prompt
- on-policy generation can use either the student prompt or the teacher-conditioned prompt via `generate_from_teacher`
- `num_loss_tokens_to_skip` can exclude initial completion tokens from the distillation loss
- SDFT currently supports text-only training and does not support `use_vllm=True`
- the shared dataset contract is `prompt` plus `privileged_context`

## Usage

```python
from datasets import Dataset

from trl.experimental.sdft import SDFTConfig, SDFTTrainer

dataset = Dataset.from_dict(
    {
        "prompt": [[{"role": "user", "content": "Solve 2+2."}]],
        "privileged_context": ["Example answer: 4."],
    }
)

training_args = SDFTConfig(
    output_dir="sdft-model",
    distillation_alpha=0.5,
    distillation_topk=5,
    max_completion_length=64,
)

trainer = SDFTTrainer(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    args=training_args,
    train_dataset=dataset,
)
trainer.train()
```

To generate from the teacher-conditioned prompt instead of the student prompt, set `generate_from_teacher=True`.
To customize how the teacher prompt is built, set `teacher_prompt_template` on `SDFTConfig`.

## Expected dataset columns

Each example must provide:

- `prompt`: the student-facing prompt
- `privileged_context`: only the extra teacher-only information, such as a demonstration, hint, or privileged feedback

Both standard text prompts and conversational prompts are supported by the trainer prompt handling.

## Callbacks

The trainer emits a small set of callback hooks that are useful for debugging, observability, and tests. These hooks are intended as practical integration points for experimental self-distillation workflows.

Shared self-distillation hooks:

- `on_self_distillation_batch_prepared`: fired when a self-distillation batch is ready. The payload includes `prompt_ids`, `completion_ids`, and `old_per_token_logps` when importance-sampling clipping inputs are available.
- `on_generation_batch_built`: fired when a new buffered generation batch is created. The payload includes `generate_every` and `steps_per_generation`.

SDFT-specific hook:

- `on_generation_prompts_selected`: fired when SDFT chooses the prompt source for on-policy generation. The payload includes the selected `generation_prompts` and the corresponding `generation_prompt_text`.

## Example script

Use [`trl/experimental/sdft/sdft.py`](https://github.com/huggingface/trl/blob/main/trl/experimental/sdft/sdft.py) to launch SDFT training from the command line. The script supports any causal LM from the Hub, custom local datasets via `--dataset_path`, and PEFT/LoRA via the standard `ModelConfig` flags.

```bash
python trl/experimental/sdft/sdft.py \
    --model_name_or_path Qwen/Qwen3.5-0.8B \
    --dataset_name your-org/your-dataset \
    --output_dir outputs/sdft-qwen3.5-0.8b \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --learning_rate 2e-5 \
    --max_prompt_length 1024 \
    --max_completion_length 512 \
    --generate_from_teacher \
    --sync_ref_model \
    --ref_model_sync_steps 1 \
    --ref_model_mixup_alpha 0.01 \
    --eval_strategy steps \
    --eval_steps 50 \
    --report_to wandb
```

## SDFTConfig[[trl.experimental.sdft.SDFTConfig]]

#### trl.experimental.sdft.SDFTConfig[[trl.experimental.sdft.SDFTConfig]]

[Source](https://github.com/huggingface/trl/blob/v1.4.0/trl/experimental/sdft/sdft_config.py#L21)

Configuration class for `SDFTTrainer`.

This adapts the official SDFT implementation to the TRL trainer API while reusing the common self-distillation
configuration shared with SDPO.

**Parameters:**

disable_dropout (`bool`, *optional*, defaults to `True`) : Whether to disable dropout in the student and teacher models.

generate_from_teacher (`bool`, *optional*, defaults to `False`) : Whether on-policy generation should use the teacher-conditioned prompt instead of the student prompt.

teacher_prompt_template (`str`, *optional*, defaults to `"{prompt}\n\n{privileged_context}"`) : Template used to combine the student prompt and privileged context into the teacher prompt.

num_loss_tokens_to_skip (`int`, *optional*, defaults to `0`) : Number of initial completion tokens to exclude from the distillation loss.

## SDFTTrainer[[trl.experimental.sdft.SDFTTrainer]]

#### trl.experimental.sdft.SDFTTrainer[[trl.experimental.sdft.SDFTTrainer]]

[Source](https://github.com/huggingface/trl/blob/v1.4.0/trl/experimental/sdft/sdft_trainer.py#L141)

Trainer for SDFT-style on-policy self-distillation with explicit teacher prompts.

traintrl.experimental.sdft.SDFTTrainer.trainhttps://github.com/huggingface/trl/blob/v1.4.0/transformers/trainer.py#L1325[{"name": "resume_from_checkpoint", "val": ": str | bool | None = None"}, {"name": "trial", "val": ": optuna.Trial | dict[str, Any] | None = None"}, {"name": "ignore_keys_for_eval", "val": ": list[str] | None = None"}]- **resume_from_checkpoint** (`str` or `bool`, *optional*) --
  If a `str`, local path to a saved checkpoint as saved by a previous instance of `Trainer`. If a
  `bool` and equals `True`, load the last checkpoint in *args.output_dir* as saved by a previous instance
  of `Trainer`. If present, training will resume from the model/optimizer/scheduler states loaded here.
- **trial** (`optuna.Trial` or `dict[str, Any]`, *optional*) --
  The trial run or the hyperparameter dictionary for hyperparameter search.
- **ignore_keys_for_eval** (`list[str]`, *optional*) --
  A list of keys in the output of your model (if it is a dictionary) that should be ignored when
  gathering predictions for evaluation during the training.0`~trainer_utils.TrainOutput`Object containing the global step count, training loss, and metrics.

Main training entry point.

**Parameters:**

resume_from_checkpoint (`str` or `bool`, *optional*) : If a `str`, local path to a saved checkpoint as saved by a previous instance of `Trainer`. If a `bool` and equals `True`, load the last checkpoint in *args.output_dir* as saved by a previous instance of `Trainer`. If present, training will resume from the model/optimizer/scheduler states loaded here.

trial (`optuna.Trial` or `dict[str, Any]`, *optional*) : The trial run or the hyperparameter dictionary for hyperparameter search.

ignore_keys_for_eval (`list[str]`, *optional*) : A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training.

**Returns:**

``~trainer_utils.TrainOutput``

Object containing the global step count, training loss, and metrics.
#### save_model[[trl.experimental.sdft.SDFTTrainer.save_model]]

[Source](https://github.com/huggingface/trl/blob/v1.4.0/transformers/trainer.py#L3752)

Will save the model, so you can reload it using `from_pretrained()`.

Will only save from the main process.
#### push_to_hub[[trl.experimental.sdft.SDFTTrainer.push_to_hub]]

[Source](https://github.com/huggingface/trl/blob/v1.4.0/transformers/trainer.py#L3999)

Upload `self.model` and `self.processing_class` to the 🤗 model hub on the repo `self.args.hub_model_id`.

**Parameters:**

commit_message (`str`, *optional*, defaults to `"End of training"`) : Message to commit while pushing.

blocking (`bool`, *optional*, defaults to `True`) : Whether the function should return only when the `git push` has finished.

token (`str`, *optional*, defaults to `None`) : Token with write permission to overwrite Trainer's original args.

revision (`str`, *optional*) : The git revision to commit from. Defaults to the head of the "main" branch.

kwargs (`dict[str, Any]`, *optional*) : Additional keyword arguments passed along to `~Trainer.create_model_card`.

**Returns:**

The URL of the repository where the model was pushed if `blocking=False`, or a `Future` object tracking the
progress of the commit if `blocking=True`.

