// about

The fastest patch is the one you don't have to apply.

fast-axolotl is a Rust extension for the Axolotl LLM fine-tuning toolkit. Its job is to make the slow parts faster without asking you to change a single line of training code. You install it. You import it. The shim does the rest.

Why drop-in matters

Most accelerators ask you to fork your codebase. They publish a faster Trainer, a faster collator, a faster DataLoader — and now you have a private branch that needs to be reconciled every time upstream Axolotl moves. The "fast" path becomes a maintenance cost.

fast-axolotl takes the opposite shape. It installs a sys.modules shim at import time. When your training script does from axolotl.utils.data.rust_streaming import streaming_dataset_reader, the shim returns our Rust implementation. The Axolotl source on disk is untouched. When Axolotl releases a new version, you upgrade Axolotl normally; the shim continues to bind the same module paths.

What's actually accelerated

We currently ship four operations from a single Rust crate:

Streaming data loading — Parquet, Arrow, JSON, JSONL, CSV, plain text, with transparent ZSTD and Gzip. Measured 77x faster than the HuggingFace-datasets path at 50,000 rows.
Parallel hashing — multi-threaded SHA256 across rows for deduplication. 1.9x at 100,000 rows on a 16-core box.
Token packing — replaces torch.cat() loops with a single Rust pass. At a 10,000-sequence benchmark, FFI overhead currently makes it slower than Python; the README documents this and notes the gain shows up on real LLM-training sizes.
Batch padding — same story as packing at the benchmark size. The function is exposed and tested; the speedup is workload-dependent.

We chose to publish the two slowdowns alongside the wins. A drop-in shim that lies in the README is a drop-in shim you eventually rip out at 3am. We'd rather you skim the benchmark table and decide which functions you actually want.

How the shim is built

fast-axolotl is one Rust crate compiled with PyO3 + maturin into a Python wheel. Importing the package runs an install() routine that:

Loads the native extension.
Creates virtual modules at axolotl.utils.data.rust_streaming, axolotl.utils.data, and axolotl.utils.collators in sys.modules.
Binds streaming_dataset_reader, fast_parallel_hash_rows, fast_deduplicate_indices, fast_pad_sequences, and fast_create_padding_mask onto those modules.

You can opt out at runtime with fast_axolotl.uninstall() and back in with install(). is_available() returns True if the Rust extension is currently bound.

What's tested

The repo runs two CI workflows: ci.yml (build matrix across Linux, macOS, Windows for Python 3.10 through 3.13) and compatibility-tests.yml (the shim exercised against a real Axolotl install). At the time of writing both badges are green. COMPATIBILITY.md in the repo lists every operation marked Tested: extension loading, module shimming, all four streaming formats, packing, hashing, padding, and end-to-end Axolotl integration.

Who builds it

fast-axolotl is authored by Dipankar Sarkar and maintained by Neul Labs. It is MIT-licensed and lives at github.com/neul-labs/fast-axolotl. Bug reports go to the Issues tracker; design conversations go to Discussions.

What's next

The four operations above are the shipping surface. The honest todos: closing the small-batch FFI gap on packing and padding, more streaming formats with zero-copy where possible, and a benchmark mode that exercises real-world dataset sizes (instead of the synthetic 10k-sequence numbers in the README). When those land, they land in the same wheel — the import contract does not change.