Measuring Axolotl throughput on memory-bound workloads
The way most people benchmark an Axolotl fine-tune is to read tokens-per-second off the trainer log and stop there. That number is useful when the bottleneck is the GPU. It is misleading the moment the bottleneck moves to the data pipeline, which is exactly the regime fast-axolotl is trying to fix.
This post is about how to measure throughput honestly when memory is the constraint, and how to read the fast-axolotl README benchmark in that light.
Why tokens-per-second isn’t enough
tokens/sec measures the rate at which the model consumes batches. It’s a
reasonable proxy for end-to-end throughput when:
- the data loader can keep the GPU fed without stalling,
- the dataset fits comfortably in host memory plus prefetch buffer,
- the training step is the longest pole.
Take any of those assumptions away and the metric starts lying. A loader that
takes 2 seconds to assemble a batch and a 200ms forward pass average out to a
tokens/sec number that looks fine on a log line but masks a 10x stall. Worse,
the stall is intermittent — between shard boundaries, during shuffling, after a
dedupe pass — so it shows up as variance rather than a baseline.
For memory-bound workloads you need a different vocabulary.
Three measurements that actually matter
When the data pipeline is suspect, the three measurements we look at first are:
Reader throughput (rows/sec). How many input rows can your reader produce per
second, measured at the boundary between the reader and the trainer? This is the
ceiling on tokens/sec. The README’s 77x speedup is exactly this measurement:
streaming_dataset_reader produces 50,000 Parquet rows in 0.0094s; the
HuggingFace-datasets Python equivalent takes 0.7237s. Reader rate goes from ~69k
rows/sec to ~5.3M rows/sec.
Working-set size. What’s the peak resident memory of the data pipeline during
a steady-state batch fetch, measured with RSS or psutil.Process().memory_info()?
A reader that doesn’t materialise (like fast-axolotl’s streaming reader) holds a
bounded working set proportional to batch_size, not to the dataset. That bound
is the property that matters at 100 GB+; the throughput number is a consequence.
Stall fraction. What percentage of training-step wall-clock is spent waiting
on the next batch? torch.profiler will tell you this, but a poor man’s version
is time.monotonic() deltas around the for batch in loader line. If your stall
fraction is above 10%, your bottleneck is the reader regardless of what
tokens/sec says.
These three numbers, together, tell you whether the trainer is compute-bound or data-bound. Tokens-per-second on its own can’t.
What the README benchmark actually measures
Worth reading the fast-axolotl BENCHMARK.md carefully, because it’s specific
about scope:
Tested on Linux x86_64, Python 3.11.13, 16 CPU cores, 62 GB RAM.
Streaming Data Loading (Parquet) | 50,000 rows | 0.0094s | 0.7237s | 77.26x
That’s reader throughput, isolated. It is not an end-to-end fine-tune speedup claim. It’s the number you’d see if you replaced the Python streaming path with the Rust one and benchmarked the reader alone at 50k rows. End-to-end speedup depends on what fraction of your training-step wall-clock that reader was responsible for. If you were 80% stall-bound on the reader, the impact is huge. If you were 5% stall-bound, the impact is modest.
Same caveat applies to the parallel-hashing number:
Parallel Hashing (SHA256) | 100,000 rows | 0.0273s | 0.0520s | 1.90x
That’s parallel_hash_rows at 100k rows, multi-threaded, against a single-threaded
hashlib loop. If your dedupe pass is a measurable chunk of your data-prep budget
(it often is at 100M+ rows), a 1.9x speedup matters. If you only dedupe once at
ingestion time, it doesn’t.
And the two cases where the benchmark goes the other way:
Token Packing | 10,000 sequences | 0.0786s | 0.0327s | 0.42x Batch Padding | 10,000 sequences | 0.1998s | 0.1051s | 0.53x
The README notes the cause: FFI overhead at small batch sizes. The Python implementations are doing a few list-comprehensions; the Rust implementations are crossing the Python/C boundary for each call. At 10k sequences, that boundary cost dominates. At LLM-training sizes — packing real datasets into 32k-token sequences with millions of training rows — the relative cost flips. The benchmark is honest about its scope; reading it as “fast-axolotl is slower at packing” is the wrong conclusion. Reading it as “fast-axolotl is slower at this exact benchmark configuration” is the right one.
A measurement protocol that won’t lie
If you’re considering fast-axolotl (or any accelerator) for your specific Axolotl workload, the protocol we’d suggest is:
- Establish a baseline working set. Run the trainer for one epoch without
fast-axolotl. Record peak
RSS, averagetokens/sec, and stall fraction measured around the loader. - Apply the shim, no config changes.
import fast_axolotlbeforeimport axolotl. Don’t touch the YAML yet. Rerun. Compare. - Opt into streaming for large datasets. Set
dataset_use_rust_streaming: trueand bumpsequence_len. Rerun. The README notes streaming auto-engages for files > 1 GB and sequences > 10,000 tokens. - Watch the stall fraction, not just tokens/sec. If it drops, the data pipeline was the bottleneck and you got the speedup. If tokens/sec doesn’t move but stall fraction was already near zero, you were already compute-bound; fast-axolotl can’t help that.
- Measure dedupe separately if you do it. A standalone benchmark of
parallel_hash_rowson your real row sizes is more informative than guessing from the README’s 100k-row number.
This is the protocol we use ourselves when answering “does this help?” questions on the tracker. It rarely produces a single headline number — it produces a matrix — but the matrix is the honest thing.
A small detour on system-level signals
One under-appreciated source of throughput information on memory-bound workloads is the kernel itself. Three numbers worth correlating against your stall-fraction graph:
/proc/PID/statusVmRSS and VmPeak. Steady-state resident-set size and the high-water mark. IfVmPeakis more than 2xVmRSS, you have a transient spike during data prep that’s masking itself as the average. fast-axolotl’s streaming reader is designed soVmPeakandVmRSStrack each other closely; if you see a divergence after applying the shim, the spike is in your code, not ours./proc/PID/ioread_bytes. Total bytes read from disk by the process. Divide by wall-clock and compare to your disk’s measured throughput. If you’re at 90% of disk capacity, the bottleneck is I/O, and no CPU-side acceleration will move it. If you’re at 5%, decoding or transport overhead is eating the budget.vmstat 1si/so columns. Swap-in and swap-out rates. The “swap kills throughput” point from a previous note has a quantitative form: any non-zero sustainedsirate during training means random reads are hitting the swap device. The right reaction is to lower the working set, not to add more RAM.
These signals will tell you whether fast-axolotl is the right tool for your specific bottleneck. The 77x reader speedup is wonderful, but it only matters if your stall-fraction was reader-dominated to begin with. The measurement protocol is more important than the acceleration choice.
What we’d add to the README
Two open todos on our side:
- Per-format reader benchmarks. The current 77x number is Parquet-specific. Arrow, JSONL, CSV, and plain text all live in the same reader, and the relative speedup almost certainly varies by format and compression.
- Larger packing/padding benchmarks. The 10k-sequence number is honest but not representative of training-time sizes. We want to publish numbers at 100k, 1M, and 10M sequences so the FFI-overhead caveat in the README can be made concrete instead of relegated to a footnote.
If you’ve already run any of these on your own data and you’re willing to share the numbers, the issue tracker is open. The faster the benchmark matrix gets filled in by real workloads, the less anyone has to take a 77x on faith.
For now: read the benchmark, measure the three things, and let the matrix speak.