Gradient Checkpointing, Activation Offloading, and Layer Offloading

Gradient checkpointing and activation offloading are techniques used to optimize the performance of deep learning models by reducing the memory footprint and improving computational efficiency.

Enabling Gradient Checkpointing

gradient_checkpointing: true

Enabling Activation Offloading

gradient_checkpointing: true  # required for activation offloading
activation_offloading: true

Activation offloading variants:

The default activation_offloading: true offloads activations to CPU and uses CUDA streams to overlap the communications and computations when offloading.

The activation_offloading: legacy naively offloads activations to CPU and without additional optimizations.

For resource constrained environments with limited CPU memory, activation_offloading: disk offloads activations to disk instead of CPU RAM so that much larger context lengths can be trained with minimal memory.

The activation_offloading: hidden_states mode (ALST-style) is gradient checkpointing that offloads only the per-layer input (the checkpoint’s hidden_states) to CPU — one tensor per layer rather than every activation — overlapped with compute on a side CUDA stream. It replaces torch’s reentrant checkpoint (so gradient_checkpointing_kwargs.use_reentrant is forced to true) and is framework-agnostic (works under FSDP2, DTensor-aware for sequence/context parallelism).

Choosing a mode

Mode	What moves	Best for
`true`	All activations → CPU, stream-overlapped	General use; LoRA/QLoRA (offloads instead of recomputing)
`legacy`	All activations → CPU, synchronous	Lowest resident GPU memory; when the streamed path’s in-flight buffering inflates peak
`disk`	Activations → disk	Severely CPU-RAM-constrained hosts
`hidden_states`	Per-layer input only → CPU, stream-overlapped	Long-context full-parameter finetuning; reaches contexts where plain gradient checkpointing OOMs

hidden_states is designed for full-parameter training. With LoRA/QLoRA the frozen base can break reentrant checkpointing, so prefer activation_offloading: true for adapters.

Enabling Layer Offloading

layer_offloading: true

Layer offloading reduces GPU memory usage by moving frozen (non-trainable) decoder layer parameters to CPU and streaming them back to GPU one layer at a time during the forward and backward passes. This is particularly useful for LoRA/QLoRA training where most of the model’s parameters are frozen — only the trainable adapter weights stay on GPU permanently.

During training, forward and backward hooks on each decoder layer handle the transfer automatically:

Forward pass: Before a layer executes, its frozen params are loaded to GPU. The next layer is prefetched asynchronously on a separate CUDA stream for overlap.
Backward pass: Same pattern in reverse — the current layer’s frozen params are loaded and the previous layer is prefetched.

After each layer finishes, its frozen params are offloaded back to CPU pinned memory.

This approach trades some CPU-GPU transfer overhead for significant GPU memory savings — the freed memory is roughly equal to the size of all frozen parameters across all decoder layers, minus one layer’s worth that is kept on GPU at any given time.

Requirements:

CUDA GPU (CPU-only training is not supported for this feature)
Works with any HuggingFace model architecture that uses decoder layers (Llama, Mistral, Qwen, etc.)
Best combined with LoRA/QLoRA where most parameters are frozen