01
Install once
uv add fast-axolotl (or pip install). One PyPI package, prebuilt wheels for Linux, macOS, and Windows on Python 3.10 through 3.13.
fast-axolotl is a Python package that swaps Axolotl's
Python data-pipeline hot paths for Rust implementations. The README publishes a
77x speedup on Parquet streaming and
1.9x on parallel SHA256
deduplication. We document the cases where the trade-off goes the other way too.
$ uv add fast-axolotl
Resolved 1 package in 0.4s
Installed fast_axolotl v0.2.0
$ python -c "
import fast_axolotl
import axolotl
print(fast_axolotl.is_available())
"
True
$ grep rust_streaming config.yml
dataset_use_rust_streaming: true
sequence_len: 32768
dedupe: true
Auto-shimmed. No source edits to Axolotl.
streaming · 50k rows
77x
vs HuggingFace datasets baseline
sha256 · 100k rows
1.9x
multi-threaded vs hashlib loop
python wheels
3.10–3.13
linux · macos · windows
license
MIT
authored by Dipankar Sarkar
// what's accelerated
We publish all four numbers, including the two where Rust currently loses to Python at the benchmark size. Drop-in shouldn't mean "trust us" — it should mean you can read the matrix and decide.
| Operation | Measured speedup | What the README says |
|---|---|---|
| streaming_dataset_reader | 77x | Rust-based streaming for Parquet, Arrow, JSON, JSONL, CSV, and text (with ZSTD/Gzip). Measured at 50,000 rows on Linux x86_64 with 16 CPU cores. |
| parallel_hash_rows | 1.9x | Multi-threaded SHA256 over rows for deduplication. Measured at 100,000 rows, 16 cores. |
| pack_sequences | 0.4x | Token packing currently shows overhead at 10,000 sequences due to FFI cost; README notes gains realize at larger LLM-training sizes. |
| pad_sequences | 0.5x | Same story as packing — small-batch overhead at 10,000 sequences; the function is exposed but is not yet a measured win at this size. |
System: Linux x86_64, Python 3.11.13, 16 CPU cores, 62 GB RAM (see BENCHMARK.md).
// integration shape
uv add to faster epochs.01
uv add fast-axolotl (or pip install). One PyPI package, prebuilt wheels for Linux, macOS, and Windows on Python 3.10 through 3.13.
02
import fast_axolotl auto-installs an acceleration shim into sys.modules. No code changes in your Axolotl config or training script.
03
Set dataset_use_rust_streaming: true in your Axolotl YAML for >1GB datasets or sequence_len > 10000. Deduplication uses parallel hashing automatically.
04
fast_axolotl.is_available() returns True when the Rust extension is linked. Call uninstall() and install() to toggle the shim per-process.
// API surface
Every API on this list is exported from the top-level fast_axolotl module. The shim
also exposes them through the standard axolotl.utils.* import paths so existing code
gets accelerated without an edit.
streaming_dataset_reader iterates a file in fixed-size batches, never materializing the whole dataset. Format-detected by extension and decompressed transparently.
from fast_axolotl import streaming_dataset_reader
for batch in streaming_dataset_reader(
"data.parquet",
dataset_type="parquet",
batch_size=1000,
num_threads=4,
):
process(batch) parallel_hash_rows fans rows out across cores for SHA256. deduplicate_indices returns the unique row indices plus their new hashes, optionally filtered against a previously-seen set.
from fast_axolotl import deduplicate_indices
unique_idx, new_hashes = deduplicate_indices(
rows,
existing_hashes=previously_seen,
num_threads=0, # 0 = auto
) pack_sequences replaces torch.cat() loops with a single Rust pass that emits input_ids, labels, and attention_mask at exactly max_length. Useful when you compose it manually; the small-batch FFI overhead is documented honestly in the benchmark.
from fast_axolotl import pack_sequences
out = pack_sequences(
sequences=batch_ids,
max_length=2048,
pad_token_id=0,
eos_token_id=2,
) pad_sequences takes a target length (or pad_to_multiple_of for hardware alignment), supports left- and right-padding, and is a thin wrapper around the same Rust kernels. Same FFI-overhead caveat at small sizes.
from fast_axolotl import pad_sequences
pad_sequences(
seqs,
target_length=8,
pad_value=0,
padding_side="right",
) // compatibility
The repo runs two CI workflows: the standard build matrix and a separate
compatibility-tests.yml that exercises the shim against a
live Axolotl install. Both are green at the time of writing.
// honest comparisons
Two side-by-sides. We only list a row when the README backs the claim on both sides — see the page itself for what's assumed.
The baseline. Same training script, same configs. fast-axolotl just swaps the data-pipeline hot paths for Rust at import time.
A different, broader Axolotl-acceleration approach focused on kernel fusion for training itself. We target the data pipeline; the two are largely complementary.
One line of dependency, one line of import, and the streaming reader on your next Parquet shuffle is Rust. Bring the benchmark back to us if it lies.