Skip to content
fast-axolotl_
v0.2.0 · MIT · Rust + maturin

$ import fast_axolotl
# drop-in Rust acceleration
# for Axolotl, no config diff.

fast-axolotl is a Python package that swaps Axolotl's Python data-pipeline hot paths for Rust implementations. The README publishes a 77x speedup on Parquet streaming and 1.9x on parallel SHA256 deduplication. We document the cases where the trade-off goes the other way too.

~/llm-finetune
$ uv add fast-axolotl
Resolved 1 package in 0.4s
Installed fast_axolotl v0.2.0

$ python -c "
import fast_axolotl
import axolotl
print(fast_axolotl.is_available())
"
True

$ grep rust_streaming config.yml
dataset_use_rust_streaming: true
sequence_len: 32768
dedupe: true

Auto-shimmed. No source edits to Axolotl.

streaming · 50k rows

77x

vs HuggingFace datasets baseline

sha256 · 100k rows

1.9x

multi-threaded vs hashlib loop

python wheels

3.103.13

linux · macos · windows

license

MIT

authored by Dipankar Sarkar

// what's accelerated

Four operations. Two clean wins.

We publish all four numbers, including the two where Rust currently loses to Python at the benchmark size. Drop-in shouldn't mean "trust us" — it should mean you can read the matrix and decide.

Operation Measured speedup What the README says
streaming_dataset_reader 77x Rust-based streaming for Parquet, Arrow, JSON, JSONL, CSV, and text (with ZSTD/Gzip). Measured at 50,000 rows on Linux x86_64 with 16 CPU cores.
parallel_hash_rows 1.9x Multi-threaded SHA256 over rows for deduplication. Measured at 100,000 rows, 16 cores.
pack_sequences 0.4x Token packing currently shows overhead at 10,000 sequences due to FFI cost; README notes gains realize at larger LLM-training sizes.
pad_sequences 0.5x Same story as packing — small-batch overhead at 10,000 sequences; the function is exposed but is not yet a measured win at this size.

System: Linux x86_64, Python 3.11.13, 16 CPU cores, 62 GB RAM (see BENCHMARK.md).

// integration shape

Four steps from uv add to faster epochs.

01

Install once

uv add fast-axolotl (or pip install). One PyPI package, prebuilt wheels for Linux, macOS, and Windows on Python 3.10 through 3.13.

02

Import before axolotl

import fast_axolotl auto-installs an acceleration shim into sys.modules. No code changes in your Axolotl config or training script.

03

Opt into streaming

Set dataset_use_rust_streaming: true in your Axolotl YAML for >1GB datasets or sequence_len > 10000. Deduplication uses parallel hashing automatically.

04

Verify it loaded

fast_axolotl.is_available() returns True when the Rust extension is linked. Call uninstall() and install() to toggle the shim per-process.

// API surface

Functions you'll actually call.

Every API on this list is exported from the top-level fast_axolotl module. The shim also exposes them through the standard axolotl.utils.* import paths so existing code gets accelerated without an edit.

Streaming data loading

streaming_dataset_reader iterates a file in fixed-size batches, never materializing the whole dataset. Format-detected by extension and decompressed transparently.

from fast_axolotl import streaming_dataset_reader

for batch in streaming_dataset_reader(
    "data.parquet",
    dataset_type="parquet",
    batch_size=1000,
    num_threads=4,
):
    process(batch)

Parallel hashing & dedupe

parallel_hash_rows fans rows out across cores for SHA256. deduplicate_indices returns the unique row indices plus their new hashes, optionally filtered against a previously-seen set.

from fast_axolotl import deduplicate_indices

unique_idx, new_hashes = deduplicate_indices(
    rows,
    existing_hashes=previously_seen,
    num_threads=0,  # 0 = auto
)

Token packing

pack_sequences replaces torch.cat() loops with a single Rust pass that emits input_ids, labels, and attention_mask at exactly max_length. Useful when you compose it manually; the small-batch FFI overhead is documented honestly in the benchmark.

from fast_axolotl import pack_sequences

out = pack_sequences(
    sequences=batch_ids,
    max_length=2048,
    pad_token_id=0,
    eos_token_id=2,
)

Batch padding

pad_sequences takes a target length (or pad_to_multiple_of for hardware alignment), supports left- and right-padding, and is a thin wrapper around the same Rust kernels. Same FFI-overhead caveat at small sizes.

from fast_axolotl import pad_sequences

pad_sequences(
    seqs,
    target_length=8,
    pad_value=0,
    padding_side="right",
)

// compatibility

Tested, not just shipped.

The repo runs two CI workflows: the standard build matrix and a separate compatibility-tests.yml that exercises the shim against a live Axolotl install. Both are green at the time of writing.

Read COMPATIBILITY.md →
  • ok Rust extension loading
  • ok Module shimming
  • ok Streaming (Parquet, JSON, CSV, Arrow)
  • ok Token packing
  • ok Parallel hashing
  • ok Batch padding
  • ok Axolotl integration

$ uv add fast-axolotl

One line of dependency, one line of import, and the streaming reader on your next Parquet shuffle is Rust. Bring the benchmark back to us if it lies.