What does fast-axolotl actually accelerate?

Streaming dataset loading (Parquet, Arrow, JSON, JSONL, CSV, text — with ZSTD/Gzip), multi-threaded SHA256 hashing for deduplication, token packing, and batch padding. The README publishes a 77x streaming speedup and 1.9x parallel-hashing speedup on 16-core Linux. Packing and padding currently show FFI overhead at 10,000-sequence batches and are documented as such.

Do I have to change my Axolotl config?

No — the shim is drop-in. Just import fast_axolotl before axolotl. To explicitly use Rust streaming you can also set dataset_use_rust_streaming: true in your YAML and bump sequence_len.

Which Python and OS versions are tested?

Python 3.10 through 3.13, on Linux, macOS, and Windows. The repo runs CI and compatibility-tests workflows for all matrix cells.

v0.2.0 · MIT · Rust + maturin

$ import fast_axolotl
# drop-in Rust acceleration
# for Axolotl, no config diff.

fast-axolotl is a Python package that swaps Axolotl's Python data-pipeline hot paths for Rust implementations. The README publishes a 77x speedup on Parquet streaming and 1.9x on parallel SHA256 deduplication. We document the cases where the trade-off goes the other way too.

github.com/neul-labs/fast-axolotl Read the docs →

~/llm-finetune

$ uv add fast-axolotl
Resolved 1 package in 0.4s
Installed fast_axolotl v0.2.0

$ python -c "
import fast_axolotl
import axolotl
print(fast_axolotl.is_available())
"
True

$ grep rust_streaming config.yml
dataset_use_rust_streaming: true
sequence_len: 32768
dedupe: true

Auto-shimmed. No source edits to Axolotl.

streaming · 50k rows

77x

vs HuggingFace datasets baseline

sha256 · 100k rows

1.9x

multi-threaded vs hashlib loop

python wheels

3.10–3.13

linux · macos · windows

license

MIT

authored by Dipankar Sarkar

// what's accelerated

Four operations. Two clean wins.

We publish all four numbers, including the two where Rust currently loses to Python at the benchmark size. Drop-in shouldn't mean "trust us" — it should mean you can read the matrix and decide.

Operation	Measured speedup	What the README says
streaming_dataset_reader	77x	Rust-based streaming for Parquet, Arrow, JSON, JSONL, CSV, and text (with ZSTD/Gzip). Measured at 50,000 rows on Linux x86_64 with 16 CPU cores.
parallel_hash_rows	1.9x	Multi-threaded SHA256 over rows for deduplication. Measured at 100,000 rows, 16 cores.
pack_sequences	0.4x	Token packing currently shows overhead at 10,000 sequences due to FFI cost; README notes gains realize at larger LLM-training sizes.
pad_sequences	0.5x	Same story as packing — small-batch overhead at 10,000 sequences; the function is exposed but is not yet a measured win at this size.

System: Linux x86_64, Python 3.11.13, 16 CPU cores, 62 GB RAM (see BENCHMARK.md).

// integration shape

Four steps from `uv add` to faster epochs.

Install once

uv add fast-axolotl (or pip install). One PyPI package, prebuilt wheels for Linux, macOS, and Windows on Python 3.10 through 3.13.

Import before axolotl

import fast_axolotl auto-installs an acceleration shim into sys.modules. No code changes in your Axolotl config or training script.

Opt into streaming

Set dataset_use_rust_streaming: true in your Axolotl YAML for >1GB datasets or sequence_len > 10000. Deduplication uses parallel hashing automatically.

Verify it loaded

fast_axolotl.is_available() returns True when the Rust extension is linked. Call uninstall() and install() to toggle the shim per-process.

// API surface

Functions you'll actually call.

Every API on this list is exported from the top-level fast_axolotl module. The shim also exposes them through the standard axolotl.utils.* import paths so existing code gets accelerated without an edit.

Streaming data loading

streaming_dataset_reader iterates a file in fixed-size batches, never materializing the whole dataset. Format-detected by extension and decompressed transparently.

from fast_axolotl import streaming_dataset_reader

for batch in streaming_dataset_reader(
    "data.parquet",
    dataset_type="parquet",
    batch_size=1000,
    num_threads=4,
):
    process(batch)

Parallel hashing & dedupe

parallel_hash_rows fans rows out across cores for SHA256. deduplicate_indices returns the unique row indices plus their new hashes, optionally filtered against a previously-seen set.

from fast_axolotl import deduplicate_indices

unique_idx, new_hashes = deduplicate_indices(
    rows,
    existing_hashes=previously_seen,
    num_threads=0,  # 0 = auto
)

Token packing

pack_sequences replaces torch.cat() loops with a single Rust pass that emits input_ids, labels, and attention_mask at exactly max_length. Useful when you compose it manually; the small-batch FFI overhead is documented honestly in the benchmark.

from fast_axolotl import pack_sequences

out = pack_sequences(
    sequences=batch_ids,
    max_length=2048,
    pad_token_id=0,
    eos_token_id=2,
)

Batch padding

pad_sequences takes a target length (or pad_to_multiple_of for hardware alignment), supports left- and right-padding, and is a thin wrapper around the same Rust kernels. Same FFI-overhead caveat at small sizes.

from fast_axolotl import pad_sequences

pad_sequences(
    seqs,
    target_length=8,
    pad_value=0,
    padding_side="right",
)

// compatibility

Tested, not just shipped.

The repo runs two CI workflows: the standard build matrix and a separate compatibility-tests.yml that exercises the shim against a live Axolotl install. Both are green at the time of writing.

Read COMPATIBILITY.md →

ok Rust extension loading
ok Module shimming
ok Streaming (Parquet, JSON, CSV, Arrow)
ok Token packing
ok Parallel hashing
ok Batch padding
ok Axolotl integration

// honest comparisons

Where fast-axolotl fits.

Two side-by-sides. We only list a row when the README backs the claim on both sides — see the page itself for what's assumed. Have a specific question (Axolotl OOM, Unsloth, when it doesn't help)? The FAQ answers those directly.

fast-axolotl vs stock Axolotl

→

The baseline. Same training script, same configs. fast-axolotl just swaps the data-pipeline hot paths for Rust at import time.

fast-axolotl vs Unsloth

→

A different, broader Axolotl-acceleration approach focused on kernel fusion for training itself. We target the data pipeline; the two are largely complementary.

// explore

Everything on this site.

The full map — features, architecture, use cases, guides, comparisons, and engineering notes. Every section links straight to the page it names.

Features

→

The full acceleration surface — streaming reads, parallel dedup, packing, padding — with the two wins and two honest caveats.

Architecture

→

How importing fast_axolotl installs Rust implementations into sys.modules and swaps four hot paths across the Python↔Rust FFI boundary.

Use cases

→

Where drop-in acceleration pays off: fixing RAM OOM, faster data loading, high-throughput prep, and fast dataset validation in CI.

Guides

→

Step-by-step: install the shim into an existing Axolotl setup, fix RAM OOM with Rust streaming, and verify the speedup on your own dataset.

Compare

→

Honest side-by-sides against stock Axolotl (the pure-Python baseline) and Unsloth (a complementary training-phase accelerator).

Engineering notes

→

OOM at scale, the drop-in integration shape that rides upstream releases, and how to measure throughput on memory-bound workloads.

FAQ

→

Axolotl OOM (issue #2975), making data loading faster, using fast-axolotl alongside Unsloth, and when it does not help.

Glossary

→

Plain-language definitions: shim, FFI, streaming reader, token packing, batch padding, SHA256 dedup, wheels, hot paths, maturin.

About

→

The thesis behind drop-in shims, what is actually in the box, and the honest scope of the project.

$ uv add fast-axolotl

One line of dependency, one line of import, and the streaming reader on your next Parquet shuffle is Rust. Bring the benchmark back to us if it lies.

PyPI release Engineering notes

$ import fast_axolotl # drop-in Rust acceleration # for Axolotl, no config diff.