Skip to content
fast-axolotl_

← back to notes

Drop-in Rust extensions: the integration shape that works for OSS Python tools

Neul Labs · ·
rustpythonintegration

There are a lot of ways to “make a Python tool faster in Rust.” Most of them require the user to do something different — install a forked package, change their import paths, edit their config. Each of those nudges raises the cost of adopting the accelerator, and more importantly, the cost of maintaining it once upstream moves.

This post is about the shape fast-axolotl ended up with. It’s worth writing down, because it’s a shape that generalises to other OSS Python projects with a stable import surface, and it has a few non-obvious properties.

The five shapes you could pick

Roughly, the design space for “Rust accelerator for a Python tool” looks like this:

  1. Fork the upstream package. Maintain a tool-fast repo that ships the Rust bits inline. Users pip install tool-fast instead of tool. Reliable, very invasive — users have a downstream private branch forever.
  2. Replace specific modules with a wheel. Distribute a wheel that shadows specific module paths. The user installs it alongside the original. Less invasive, but their config and code still reference the original tool, so you have to think hard about resolution order.
  3. Provide an alternate API. Ship your_lib.fast as a sibling namespace. Users opt in explicitly: from tool.fast import .... Clean separation, maximum user friction — every existing call site has to change.
  4. Monkeypatch at import time. Ship a small Python package whose import side-effect rewires upstream’s namespaces. Users add one import. This is the “drop-in shim” shape.
  5. Plugin via upstream’s plugin system. Use whatever extension hook the project provides. Lowest friction, only works if such a hook exists and covers the surface you want to accelerate.

Axolotl, like most Python ML tools, doesn’t have a plugin API broad enough to intercept the data pipeline. So options 1–4 were on the table. fast-axolotl picked option 4 and the reasoning was about maintenance cost, not engineering taste.

Why option 4 won

Forking (1) means we run a parallel release cadence to Axolotl. Every upstream release becomes a merge job. We’d own the responsibility of “is Axolotl 0.x.y compatible with us?” indefinitely. For an OSS accelerator that benefits from being close to upstream, that’s the worst possible incentive structure.

Module replacement at install time (2) is plausible but operationally fragile: wheel resolution order varies across pip, uv, and conda; some projects pin to specific minor versions; and the resulting failure mode (“why is my Axolotl import returning a fast-axolotl class?”) is genuinely hard to debug from the outside.

Sibling-namespace APIs (3) put all of the friction on the user. Every example, every YAML config in the wild that references axolotl.utils.data has to change. That’s not “drop-in”; that’s a migration.

The monkeypatch-at-import-time shim (4) has a property the others don’t: the upstream package is unchanged. The user installs fast-axolotl alongside the existing Axolotl install. Adding import fast_axolotl to the entry point is the entire user-facing change. When Axolotl ships 0.x.y+1, the user upgrades Axolotl normally; our shim re-applies on the next import.

What the shim actually does

From the README and docs/usage.md:

import fast_axolotl  # Auto-installs acceleration shim

# Now use axolotl normally
import axolotl

Under the hood, importing the package runs an install() routine that:

  1. Loads the Rust extension (built with PyO3 + maturin).
  2. Creates virtual modules in sys.modules that shadow specific Axolotl module paths: axolotl.utils.data.rust_streaming, axolotl.utils.data, and axolotl.utils.collators.
  3. Binds the Rust functions onto those modules: streaming_dataset_reader, fast_parallel_hash_rows, fast_deduplicate_indices, fast_pad_sequences, fast_create_padding_mask.

The critical property is import order. fast_axolotl has to be imported before axolotl, because once Axolotl’s own modules land in sys.modules from the upstream package, future imports will return the cached object. Patching after the fact is possible but brittle (it doesn’t catch references that Axolotl modules have already bound by name during their own import).

The README is explicit about this ordering, and install() / uninstall() / is_available() give you a runtime escape hatch:

import fast_axolotl

print(fast_axolotl.is_available())  # True if Rust extension loaded
fast_axolotl.uninstall()             # Revert to pure Python
fast_axolotl.install()               # Reapply the shim

That trio is the whole shim contract. It’s tiny on purpose.

The properties that fall out

A shim of this shape gets you four properties for free:

1. Upstream-version independence. As long as the shimmed module paths exist in Axolotl, the shim works. We don’t pin to a specific Axolotl version; we observe the upstream API surface. When Axolotl renames a module, we ship a patch; when they refactor an unrelated trainer, we do nothing.

2. Per-process opt-out. A test suite that wants to bench the pure-Python path calls uninstall(). A production trainer that wants to A/B the two paths can do it in the same script. No environment-variable gymnastics, no separate venv.

3. Honest blast radius. The shim only rewires the modules listed above. If a bug appears in some unrelated Axolotl feature, the user can confirm in one minute that fast-axolotl isn’t involved: uninstall(), rerun, see if it reproduces. We’ve seen this triage step save real time on the issue tracker.

4. Distribution stays trivial. One PyPI package, wheels for Linux + macOS + Windows, Python 3.10 through 3.13. The user doesn’t need to know the shim is a shim. From their perspective, uv add fast-axolotl and an import are the entire contract.

Where this shape generalises

The shim shape works when the upstream Python project has:

  • A reasonably stable set of module paths at the layer you want to accelerate.
  • Code paths that are import-time resolved (not dynamically loaded after init).
  • A clear separation between “API surface” (where you intercept) and “internal helpers” (which you don’t touch).

LLM training tooling, data-loading libraries, and most CLI-shaped Python packages fit this profile. UI frameworks and dynamic plugin hosts typically don’t.

The one design discipline this shape imposes on the accelerator author: publish the list of intercepted paths. The user needs to be able to read usage.md and know exactly which functions they’re now running through Rust. We list them in a table in our docs; if the table grows, it’s a doc change, not a surprise.

The honest limit

The shim shape doesn’t fix every problem. Two cases where it doesn’t apply:

  • Trainer-internal kernels. Accelerating the forward pass means replacing things Axolotl imports very early — kernels, optimisers, custom loss functions. That’s the territory other accelerators (Unsloth, for example) live in. Our shim doesn’t pretend to compete there.
  • Operations that are simply not faster in Rust at your batch size. The fast-axolotl README publishes both the wins (77x streaming, 1.9x parallel hashing) and the losses (0.42x token packing, 0.53x batch padding at 10,000 sequences). The shim happily binds the slower implementations too; the user decides which functions to call.

That last point is the integration discipline a drop-in shim ultimately rests on: ship the matrix, let the user decide, and never lie about which row they’re on. For a deeper read on the API surface itself, the usage guide is the canonical reference.