Transformers documentation

DDP

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

DDP

DistributedDataParallel (DDP) maintains a full copy of a model on each GPU. Each GPU processes a non-overlapping shard of data with a forward and backward pass. Before the optimizer step, an all-reduce averages gradients across all GPUs so every model copy stays identical. Use DDP when your model fits on a single GPU.

                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                         β”‚  training data  β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚ shard 0          β”‚ shard 1          β”‚ shard 2
               β–Ό                  β–Ό                  β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚   model     β”‚    β”‚   model     β”‚    β”‚   model     β”‚
        β”‚  (copy 0)   β”‚    β”‚  (copy 1)   β”‚    β”‚  (copy 2)   β”‚
        β”‚   GPU 0     β”‚    β”‚   GPU 1     β”‚    β”‚   GPU 2     β”‚
        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
               β”‚ grads            β”‚ grads            β”‚ grads
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               all-reduce
                          (average gradients)
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β–Ό                  β–Ό                  β–Ό
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  optimizer  β”‚    β”‚  optimizer  β”‚    β”‚  optimizer  β”‚
        β”‚    step     β”‚    β”‚    step     β”‚    β”‚    step     β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          (identical)        (identical)        (identical)

DDP activates automatically when you launch with a multi-process launcher like Accelerate.

# 4 GPUs on one machine
accelerate launch --num_processes 4 train.py

Configure DDP

Pass these TrainingArguments to control DDP behavior.

  • gradient_accumulation_steps() determines when to perform the all-reduce. Trainer skips the all-reduce on intermediate accumulation steps and runs it only on the final micro-batch. For example, with gradient_accumulation_steps=4, the all-reduce runs every 4 backward passes.
  • ~TrainingArguments.ddp_find_unused_parameters traverses the autograd graph at the end of the forward pass for parameters that won’t receive a gradient and marks them as ready so they don’t block the all-reduce. Don’t use with gradient_checkpointing() because gradient checkpointing discards intermediate activations and recomputes them on the fly.
  • ~TrainingArguments.ddp_bucket_cap_mb is the bucket size for batching gradients into a single all-reduce during the backward pass. A larger bucket means fewer all-reduce calls and less launch overhead.
  • ~TrainingArguments.ddp_broadcast_buffers synchronizes model buffers (such as BatchNorm running statistics) from rank 0 to all other ranks at the start of every forward pass. Disable if your model only uses LayerNorm. Don’t use with gradient_checkpointing().
  • ~TrainingArguments.ddp_backend sets the communication backend. Use "nccl" for NVIDIA GPUs (default and fastest), "gloo" for CPU training or debugging, and "xccl", "hccl", or "cncl" for other hardware.
  • ddp_timeout() sets the time limit for all processes and operations (all-reduce, broadcast) to complete. If a process hangs, like when loading a large model slowly, the timeout raises an error instead of blocking indefinitely.
from transformers import TrainingArguments

args = TrainingArguments(
    ...,
    gradient_accumulation_steps=4,
    ddp_backend="nccl",
    ddp_find_unused_parameters=False,
    ddp_bucket_cap_mb=25,
    ddp_broadcast_buffers=True,
    ddp_timeout=1800,
)

Next steps

  • See FSDP for training models too large to fit on a single GPU.
  • See DeepSpeed for ZeRO optimization and offloading.
  • Read the Data Parallelism chapter from The Ultra-Scale Playbook for more information about how DDP works.
Update on GitHub