Jaisidh Singh

The Ultra Scale Playbook vol-2: Data Parallelism

Data Parallelism (DP)

TLDR: Instead of accumulating gradients for k steps on a single GPU

Core challenge of DP

The main issue with DP is the communication overhead added when communicating with all k GPUs to accumulate the gradient. Overlapping gradient computatio and communication addresses this well, provided the number of GPUs we perform DP over are less than 512.

Let's understand this in more detail:

Bucketed-DP

GPU operations on large tensors are more efficient than many operations on small tensors. So, we can bucket layer-wise gradients and launch a single all-reduce for all gradients within one bucket.

Optimising GPU syncs

By default, all GPUs synchronise after each backward pass. However, a single reduce at the end of the kth step would have the same effect as reducing at each of the k gradient accumulation steps. This gives us room for further optimisation.

Core idea:

DDP (the full package)

The complete package of DP with these 3 optimisations is sometimes referred to as DDP (distributed data parallelism), where our final global batch size becomes bs=mbs×gradacc×k. Note that:

Typically, we maximise k due to its parallel nature over maximising gradacc as gradient accumulation is sequential. Practically, gradient accumulation is added on top of DP to meet a global batch size.