DDP

DistributedDataParallel (DDP) maintains a full copy of a model on each GPU. Each GPU processes a non-overlapping shard of data with a forward and backward pass. Before the optimizer step, an all-reduce averages gradients across all GPUs so every model copy stays identical. Use DDP when your model fits on a single GPU.

                         ┌─────────────────┐
                         │  training data  │
                         └────────┬────────┘
               ┌──────────────────┼──────────────────┐
               │ shard 0          │ shard 1          │ shard 2
               ▼                  ▼                  ▼
        ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
        │   model     │    │   model     │    │   model     │
        │  (copy 0)   │    │  (copy 1)   │    │  (copy 2)   │
        │   GPU 0     │    │   GPU 1     │    │   GPU 2     │
        └──────┬──────┘    └──────┬──────┘    └──────┬──────┘
               │ grads            │ grads            │ grads
               └──────────────────┼──────────────────┘
                               all-reduce
                          (average gradients)
               ┌──────────────────┼──────────────────┐
               ▼                  ▼                  ▼
        ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
        │  optimizer  │    │  optimizer  │    │  optimizer  │
        │    step     │    │    step     │    │    step     │
        └─────────────┘    └─────────────┘    └─────────────┘
          (identical)        (identical)        (identical)

DDP activates automatically when you launch with a multi-process launcher like Accelerate.

# 4 GPUs on one machine
accelerate launch --num_processes 4 train.py

Configure DDP

Pass these TrainingArguments to control DDP behavior.

gradient_accumulation_steps() determines when to perform the all-reduce. Trainer skips the all-reduce on intermediate accumulation steps and runs it only on the final micro-batch. For example, with gradient_accumulation_steps=4, the all-reduce runs every 4 backward passes.
~TrainingArguments.ddp_find_unused_parameters traverses the autograd graph at the end of the forward pass for parameters that won’t receive a gradient and marks them as ready so they don’t block the all-reduce. Don’t use with gradient_checkpointing() because gradient checkpointing discards intermediate activations and recomputes them on the fly.
~TrainingArguments.ddp_bucket_cap_mb is the bucket size for batching gradients into a single all-reduce during the backward pass. A larger bucket means fewer all-reduce calls and less launch overhead.
~TrainingArguments.ddp_broadcast_buffers synchronizes model buffers (such as BatchNorm running statistics) from rank 0 to all other ranks at the start of every forward pass. Disable if your model only uses LayerNorm. Don’t use with gradient_checkpointing().
~TrainingArguments.ddp_backend sets the communication backend. Use "nccl" for NVIDIA GPUs (default and fastest), "gloo" for CPU training or debugging, and "xccl", "hccl", or "cncl" for other hardware.
ddp_timeout() sets the time limit for all processes and operations (all-reduce, broadcast) to complete. If a process hangs, like when loading a large model slowly, the timeout raises an error instead of blocking indefinitely.

from transformers import TrainingArguments

args = TrainingArguments(
    ...,
    gradient_accumulation_steps=4,
    ddp_backend="nccl",
    ddp_find_unused_parameters=False,
    ddp_bucket_cap_mb=25,
    ddp_broadcast_buffers=True,
    ddp_timeout=1800,
)

Next steps

See FSDP for training models too large to fit on a single GPU.
See DeepSpeed for ZeRO optimization and offloading.
Read the Data Parallelism chapter from The Ultra-Scale Playbook for more information about how DDP works.

Update on GitHub

Transformers

DDP

Configure DDP

Next steps