# fast-axolotl

> Drop-in Rust acceleration for the Axolotl LLM fine-tuning toolkit. A Python wheel that, when imported before axolotl, replaces four data-pipeline operations with Rust implementations: streaming dataset reading (Parquet, Arrow, JSON, JSONL, CSV, text with ZSTD/Gzip), multi-threaded SHA256 deduplication, token packing, and batch padding. MIT-licensed. Authored by Dipankar Sarkar at Neul Labs.

## What fast-axolotl is

fast-axolotl is one Rust crate compiled with PyO3 and maturin into a Python wheel. Importing the package runs an install() routine that:

1. Loads the native extension.
2. Creates virtual modules in sys.modules at axolotl.utils.data.rust_streaming, axolotl.utils.data, and axolotl.utils.collators.
3. Binds Rust functions (streaming_dataset_reader, fast_parallel_hash_rows, fast_deduplicate_indices, fast_pad_sequences, fast_create_padding_mask) onto those modules.

The upstream Axolotl package is unchanged on disk. When Axolotl ships a new version, the user upgrades Axolotl normally; the fast-axolotl shim re-applies on the next import.

## What the README claims (verbatim)

System: Linux x86_64, Python 3.11.13, 16 CPU cores, 62 GB RAM.

| Operation | Data Size | Rust | Python | Speedup |
|-----------|-----------|------|--------|---------|
| Streaming Data Loading (Parquet) | 50,000 rows | 0.0094s | 0.7237s | 77.26x |
| Parallel Hashing (SHA256) | 100,000 rows | 0.0273s | 0.0520s | 1.90x |
| Token Packing | 10,000 sequences | 0.0786s | 0.0327s | 0.42x |
| Batch Padding | 10,000 sequences | 0.1998s | 0.1051s | 0.53x |

The two below-1.0 numbers (Token Packing and Batch Padding at 10,000 sequences) are documented in the README: the FFI overhead dominates at small batch sizes. The site does not invent any other numbers.

## API surface (from the README and docs/usage.md)

- streaming_dataset_reader(file_path, dataset_type, batch_size, num_threads): generator that yields batches without materialising the dataset. Supports Parquet, Arrow, JSON, JSONL, CSV, text, all with ZSTD and Gzip.
- parallel_hash_rows(rows, num_threads=0): multi-threaded SHA256 across rows. 0 = auto-detect cores.
- deduplicate_indices(rows, existing_hashes=None, num_threads=0): returns (unique_indices, new_hashes).
- pack_sequences(sequences, max_length, pad_token_id, eos_token_id, label_pad_id=-100): single-pass Rust packing into fixed-length chunks.
- concatenate_and_pack(input_ids, labels, attention_masks, max_length, pad_token_id, label_pad_id): lower-level packing with explicit inputs.
- pad_sequences(sequences, target_length, pad_value, padding_side, pad_to_multiple_of): batch padding.
- list_supported_formats() / detect_format(path): introspection.
- install() / uninstall() / is_available(): shim control.

## Axolotl configuration

Two YAML keys engage fast-axolotl explicitly:

- dataset_use_rust_streaming: true — use the Rust streaming reader. Auto-engages for files > 1 GB or sequence_len > 10,000.
- dedupe: true — uses parallel hashing automatically when the shim is installed.

## Supported platforms

- Python 3.10, 3.11, 3.12, 3.13.
- Linux, macOS, Windows.
- CI matrix runs on every commit (ci.yml). Compatibility against a live Axolotl install runs as a separate workflow (compatibility-tests.yml).

## Honest scope

fast-axolotl accelerates the data pipeline (read, dedupe, pack, pad). It does not modify the trainer, the optimiser, the attention kernels, or the model code. Accelerators that target those layers (such as Unsloth) are complementary; the two can be used together.

The four operations the README publishes are the entire shipping surface. Two of them are speedups at the benchmark size, two are speedups only at training-scale sizes (per the README footnote). The site treats both honestly.

## Honest comparisons

- vs stock Axolotl: Stock Axolotl uses pure-Python data pipelines (HuggingFace datasets streaming, hashlib loops, torch.cat for packing). fast-axolotl swaps the read and dedupe paths for Rust; the packing/padding swap is offered but is workload-dependent.
- vs Unsloth: Unsloth focuses on fused training kernels (attention, LoRA). fast-axolotl focuses on the data pipeline. Different layers of the stack; commonly stacked together.

## Pages

- /
- /about/
- /blog/
- /blog/oom-fails-for-large-training-datasets/
- /blog/drop-in-rust-extensions-integration-shape/
- /blog/measuring-axolotl-throughput-memory-bound/
- /compare/stock-axolotl/
- /compare/unsloth/
- /404
- /rss.xml
- /sitemap-index.xml
- https://docs.neullabs.com/fast-axolotl/ (external docs)
- https://github.com/neul-labs/fast-axolotl (source)
- https://pypi.org/project/fast-axolotl/ (PyPI)