Why generic OOM-handling fails for >100GB training datasets
The first time an LLM fine-tune crashes with MemoryError halfway through epoch one, the
instinct is to reach for a familiar Python tool: catch the exception, halve the batch, try
again. That works for inference. It doesn’t work for the data pipeline of a 100 GB-class
training run, and the reason it doesn’t work is mostly architectural.
This post is about why the standard OOM patches fail at training scale, and why the fix that actually holds is the boring one: stream the data in fixed-size batches and never materialise the dataset.
What “OOM-handling” usually means
In a typical Python data application the OOM-handling pattern looks like this:
- Catch
MemoryErroror a Linux OOM-kill signal. - Reduce some knob — batch size, prefetch depth, chunk size.
- Retry the failing operation.
- If three retries fail, fall back to disk-backed pandas or spill into swap.
This works because most Python data applications are doing one of two things: either (a) a single transformation on a dataset that already fits in memory, or (b) an inference loop where the bottleneck is one model call at a time. In both cases, the “memory pressure” arrives in discrete pulses you can react to.
LLM fine-tune data preparation is neither of those things.
Three things that break at training scale
The Axolotl-style fine-tune pipeline walks every row of the training corpus at least once per epoch, often with several derived passes (tokenisation, packing, dedupe, shuffling). At 100 GB of Parquet, that introduces three failure modes that the generic OOM patches simply weren’t designed for.
1. The dataset is bigger than the memory budget by an order of magnitude
A 100 GB raw Parquet corpus is 200–300 GB when materialised as Arrow tables, more again when the rows are tokenised. The host has 64 or 128 GB of RAM. There is no “smaller batch” that fits. Halving the batch from 32 to 16 reduces the spike, not the ceiling. Halving again drops throughput more than it drops the working set. The OOM isn’t a tail event — it’s the steady state.
The HuggingFace datasets library has a streaming mode for this reason, and Axolotl
exposes it. But the default Python path still materialises substantial intermediate
state per shard, especially when compression and decoding are involved. The retry
loop has nothing to do.
2. The OOM doesn’t fire from your Python code
When the kernel OOM-killer fires, it doesn’t ask your try/except block politely.
The whole process dies with SIGKILL. There is no Python frame to catch the
exception in. Even if you wrap the trainer entry point in a watchdog, you’ve lost
all of the in-process state — the shuffled order, the per-shard cursor, the partial
tokenisation cache. Restart is expensive and often non-deterministic.
A retry-loop pattern silently assumes the program survives its own memory pressure. At 100 GB it doesn’t.
3. Swap kills throughput long before it kills the process
The “swap-spillover” workaround — let the kernel push pages out, accept the slowdown — is a non-starter at training scale because the access pattern is wrong. Training data is read sequentially, then shuffled, then read again from a different order. Once shuffling pushes you into swap, every random access becomes a disk seek. The training step that used to take 200ms now takes seconds. Total wall-clock blows up by 10–50x. The job finishes, technically, but you’ve wasted an entire GPU-day.
This is the case where the OOM-handling code “succeeds” — the process didn’t crash! — and the team only finds out later when the per-step latency graph looks like a staircase.
What works instead
The honest fix is to stop trying to handle OOM. Eliminate the condition that produces it. That means the reader has to never hold more than one batch of rows in memory at any time, and the format-decoding cost has to be paid streamwise, not on materialised tables.
This is the regime fast-axolotl’s streaming_dataset_reader was built for. From the
README:
from fast_axolotl import streaming_dataset_reader
for batch in streaming_dataset_reader(
"/path/to/large_dataset.parquet",
dataset_type="parquet",
batch_size=1000,
num_threads=4,
):
process(batch)
A few things to notice:
- The function is a generator. It never holds the second batch and the first batch
in memory at the same time. The memory ceiling is determined by
batch_size— one knob, easy to reason about. num_threads=4parallelises the decode, not the buffering. You get throughput without raising the high-water mark.- ZSTD and Gzip are decompressed inline; you don’t pay the cost of writing a decompressed staging file to disk first.
The README benchmark on this reader is 77x faster than the HuggingFace-datasets Python streaming path at 50,000 rows (Linux x86_64, Python 3.11, 16 cores). At 100 GB+ the speedup matters less than the property that the reader simply doesn’t OOM, because the working set is bounded by the configured batch size.
Where dedupe fits
OOM at scale is rarely just a reader problem. The second-largest source of memory
spikes in an Axolotl pipeline is deduplication: the canonical “load all row hashes
into a set, then filter” pattern allocates one Python str per row plus a Python
set entry, easily 200–400 bytes of overhead per row. At 100 GB / 100M rows that’s
a 20–40 GB live set before any training data is loaded.
fast-axolotl’s deduplicate_indices flips the storage: the row hashes live in a
Rust-side buffer (~32 bytes each, no Python overhead), the function returns only the
indices to keep, and the hashing itself runs across cores. The README benchmark is
1.9x faster than a single-threaded hashlib loop at 100,000 rows. More importantly,
the live memory cost during the dedupe pass is roughly 1/10th of the equivalent
Python set.
That difference is the gap between “the dedupe pass fits alongside the trainer” and “we have to split deduplication into a separate offline job because it can’t share a node.”
The principles, restated
If you have a fine-tune that has flirted with OOM, the lesson isn’t “tune the exception handler.” It’s:
- Bound the working set explicitly. Stream, batch-size, and never materialise.
- Move per-row state out of Python. Rust-side buffers cost ~32 bytes per row, not 300.
- Pay the decode cost as you go. Inline decompression for ZSTD and Gzip beats any spill-to-disk dance.
- Treat swap as a failure mode, not a fallback. Once training is faulting to disk, you’ve lost most of the wall-clock budget already.
These aren’t fast-axolotl-specific principles, but they are the principles fast-axolotl is built on. The shim is a way to apply them without rewriting your Axolotl trainer. The honest version of OOM-handling at 100 GB is: don’t.
For the API and the benchmark details, see the README and BENCHMARK.md. If you have a corpus where this reasoning doesn’t hold — where retry-with-smaller-batch does save the job — we’d genuinely like to hear about it on the issue tracker.