The Ultra Scale Playbook vol-2: Data Parallelism

20 Aug, 2025

Data Parallelism (DP)

TLDR: Instead of accumulating gradients for $k$ steps on a single GPU

use $k$ GPUs that each have a copy of the model
each GPU processes a unique microbatch in parallel

Core challenge of DP

The main issue with DP is the communication overhead added when communicating with all $k$ GPUs to accumulate the gradient. Overlapping gradient computatio and communication addresses this well, provided the number of GPUs we perform DP over are less than 512.

Let's understand this in more detail:

DP uses multiple instances of our model on multiple GPUs such that each micro-batch ( $b s / k$ ) can be processed in parallel.
It is natural to think that after all the parallel forward+backward passes are done, we need to wait for the GPUs to synchronise (communicate) so that the gradients stored across all GPUs are averaged via all-reduce, however, this is a BIG NO NO.
We don't want our GPUs to be idle while they communicate, so we overlap computation with inter-GPU communication: gradients of layers can be gathered and summed before those of earlier layers, so by the time we compute the gradient of early layers, we already have a reduced sum of all previous gradients.

Bucketed-DP

GPU operations on large tensors are more efficient than many operations on small tensors. So, we can bucket layer-wise gradients and launch a single all-reduce for all gradients within one bucket.

Optimising GPU syncs

By default, all GPUs synchronise after each backward pass. However, a single reduce at the end of the $k^{t h}$ step would have the same effect as reducing at each of the $k$ gradient accumulation steps. This gives us room for further optimisation.

Core idea:

Reduce (with bucketed overlap) only once before optimiser.step() gets called.
PyTorch typically solves this with a model.no_sync() decorator that disables gradient sync on backward passes that don’t need reduction.

DDP (the full package)

The complete package of DP with these 3 optimisations is sometimes referred to as DDP (distributed data parallelism), where our final global batch size becomes $b s = m b s \times g r a d a c c \times k$ . Note that:

$m b s$ is the microbatch size (samples shown per GPU)
$g r a d a c c$ is the number of gradient accumulation steps (accumulate gradients per GPU across $k$ GPUs.)

Typically, we maximise $k$ due to its parallel nature over maximising $g r a d a c c$ as gradient accumulation is sequential. Practically, gradient accumulation is added on top of DP to meet a global batch size.