Gradient Checkpointing, Activation Offloading, and Layer Offloading
Gradient checkpointing and activation offloading are techniques used to optimize the performance of deep learning models by reducing the memory footprint and improving computational efficiency.
Enabling Gradient Checkpointing
gradient_checkpointing: trueEnabling Activation Offloading
gradient_checkpointing: true # required for activation offloading
activation_offloading: trueActivation offloading variants:
The default activation_offloading: true offloads activations to CPU and uses CUDA streams
to overlap the communications and computations when offloading.
The activation_offloading: legacy naively offloads activations to CPU and without additional optimizations.
For resource constrained environments with limited CPU memory, activation_offloading: disk offloads
activations to disk instead of CPU RAM so that much larger context lengths can be trained with minimal memory.
The activation_offloading: hidden_states mode (ALST-style) is gradient checkpointing that offloads only
the per-layer input (the checkpoint’s hidden_states) to CPU — one tensor per layer rather than every
activation — overlapped with compute on a side CUDA stream. It replaces torch’s reentrant checkpoint
(so gradient_checkpointing_kwargs.use_reentrant is forced to true) and is framework-agnostic
(works under FSDP2, DTensor-aware for sequence/context parallelism).
Choosing a mode
| Mode | What moves | Best for |
|---|---|---|
true |
All activations → CPU, stream-overlapped | General use; LoRA/QLoRA (offloads instead of recomputing) |
legacy |
All activations → CPU, synchronous | Lowest resident GPU memory; when the streamed path’s in-flight buffering inflates peak |
disk |
Activations → disk | Severely CPU-RAM-constrained hosts |
hidden_states |
Per-layer input only → CPU, stream-overlapped | Long-context full-parameter finetuning; reaches contexts where plain gradient checkpointing OOMs |
hidden_states is designed for full-parameter training. With LoRA/QLoRA the frozen base can break reentrant
checkpointing, so prefer activation_offloading: true for adapters.
Enabling Layer Offloading
layer_offloading: trueLayer offloading reduces GPU memory usage by moving frozen (non-trainable) decoder layer parameters to CPU and streaming them back to GPU one layer at a time during the forward and backward passes. This is particularly useful for LoRA/QLoRA training where most of the model’s parameters are frozen — only the trainable adapter weights stay on GPU permanently.
During training, forward and backward hooks on each decoder layer handle the transfer automatically:
- Forward pass: Before a layer executes, its frozen params are loaded to GPU. The next layer is prefetched asynchronously on a separate CUDA stream for overlap.
- Backward pass: Same pattern in reverse — the current layer’s frozen params are loaded and the previous layer is prefetched.
After each layer finishes, its frozen params are offloaded back to CPU pinned memory.
This approach trades some CPU-GPU transfer overhead for significant GPU memory savings — the freed memory is roughly equal to the size of all frozen parameters across all decoder layers, minus one layer’s worth that is kept on GPU at any given time.
Requirements:
- CUDA GPU (CPU-only training is not supported for this feature)
- Works with any HuggingFace model architecture that uses decoder layers (Llama, Mistral, Qwen, etc.)
- Best combined with LoRA/QLoRA where most parameters are frozen