// about
The fastest patch is the one you don't have to apply.
fast-axolotl is a Rust extension for the Axolotl LLM fine-tuning toolkit. Its job is to make the slow parts faster without asking you to change a single line of training code. You install it. You import it. The shim does the rest.
Why drop-in matters
Most accelerators ask you to fork your codebase. They publish a faster Trainer, a faster
collator, a faster DataLoader — and now you have a private branch that needs to be
reconciled every time upstream Axolotl moves. The "fast" path becomes a maintenance cost.
fast-axolotl takes the opposite shape. It installs a sys.modules shim at import time. When
your training script does from axolotl.utils.data.rust_streaming import streaming_dataset_reader,
the shim returns our Rust implementation. The Axolotl source on disk is untouched. When Axolotl
releases a new version, you upgrade Axolotl normally; the shim continues to bind the same module
paths.
What's actually accelerated
We currently ship four operations from a single Rust crate:
- Streaming data loading — Parquet, Arrow, JSON, JSONL, CSV, plain text, with transparent ZSTD and Gzip. Measured 77x faster than the HuggingFace-datasets path at 50,000 rows.
- Parallel hashing — multi-threaded SHA256 across rows for deduplication. 1.9x at 100,000 rows on a 16-core box.
- Token packing — replaces
torch.cat()loops with a single Rust pass. At a 10,000-sequence benchmark, FFI overhead currently makes it slower than Python; the README documents this and notes the gain shows up on real LLM-training sizes. - Batch padding — same story as packing at the benchmark size. The function is exposed and tested; the speedup is workload-dependent.
We chose to publish the two slowdowns alongside the wins. A drop-in shim that lies in the README is a drop-in shim you eventually rip out at 3am. We'd rather you skim the benchmark table and decide which functions you actually want.
How the shim is built
fast-axolotl is one Rust crate compiled with PyO3 + maturin into a Python wheel.
Importing the package runs an install() routine that:
- Loads the native extension.
- Creates virtual modules at
axolotl.utils.data.rust_streaming,axolotl.utils.data, andaxolotl.utils.collatorsinsys.modules. - Binds
streaming_dataset_reader,fast_parallel_hash_rows,fast_deduplicate_indices,fast_pad_sequences, andfast_create_padding_maskonto those modules.
You can opt out at runtime with fast_axolotl.uninstall() and back in with
install(). is_available() returns True if the Rust extension is currently bound.
What's tested
The repo runs two CI workflows:
ci.yml (build matrix across Linux, macOS, Windows for Python 3.10 through 3.13) and
compatibility-tests.yml (the shim exercised against a real Axolotl install). At the time of writing
both badges are green. COMPATIBILITY.md in the repo lists every operation marked
Tested: extension loading, module shimming, all four streaming formats, packing, hashing,
padding, and end-to-end Axolotl integration.
Who builds it
fast-axolotl is authored by Dipankar Sarkar and maintained by Neul Labs. It is MIT-licensed and
lives at github.com/neul-labs/fast-axolotl.
Bug reports go to the Issues tracker; design conversations go to Discussions.
What's next
The four operations above are the shipping surface. The honest todos: closing the small-batch FFI gap on packing and padding, more streaming formats with zero-copy where possible, and a benchmark mode that exercises real-world dataset sizes (instead of the synthetic 10k-sequence numbers in the README). When those land, they land in the same wheel — the import contract does not change.