Arcwright

Arcwright is a Gemma 4 E4B-it model fine-tuned for modern Rust web and AI frameworks: Leptos, Axum, and Rig.

On RustWebBench-15, Arcwright scores 6.87 / 10 overall — beating not only its base model (Gemma 4 E4B, 5.00) but also:

  • Gemma 4 26B-A4B (6.59), the 6.5× larger model from the same family
  • Claude Haiku (6.73)
  • Gemini (5.96)
  • Qwen3-Coder 30B-A3B (5.75)

Leaderboard

Rank Model Leptos Axum Rig Overall
1 Arcwright 8.40 8.28 3.92 6.87
2 Claude Haiku 7.16 8.04 5.00 6.73
3 Gemma 4 26B-A4B 7.72 8.20 3.84 6.59
4 Gemini 6.96 7.40 3.52 5.96
5 Qwen3-Coder 30B-A3B 7.36 6.20 3.68 5.75
6 Gemma 4 E4B-it (base) 5.24 6.84 2.92 5.00
7 Qwen3 8B 5.52 5.28 3.08 4.63
8 Qwen2.5-Coder 7B 4.28 4.68 1.64 3.53

All models evaluated on the same 15 prompts (5 per crate), judged on 5 dimensions (1-10): correctness, completeness, idiomatic, crate_knowledge, explanation.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "y0sif/Arcwright", torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("y0sif/Arcwright")

msgs = [{"role": "user", "content": [
    {"type": "text", "text": "Write a Leptos counter component with increment/decrement buttons."}
]}]
inputs = tokenizer.apply_chat_template(
    msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
out = model.generate(inputs, max_new_tokens=1024)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

A lightweight LoRA-only version is available at y0sif/Arcwright-LoRA — apply on top of unsloth/gemma-4-E4B-it.

Training details

  • Base: unsloth/gemma-4-E4B-it (4-bit)
  • Method: QLoRA, conservative settings (see below)
  • Dataset: y0sif/Arcwright-v4-Combined — 1,007 train / 109 test
    • Leptos: 334 curated pairs
    • Axum: 317 curated pairs
    • Rig: 158 compile-verified pairs (115 from examples/ + 43 compile-passing supplements)
    • General Rust: 307 pairs from Strandset-Rust-v1 (replay buffer to prevent catastrophic forgetting)
  • Training data pipeline: 3-gate quality pipeline — sub-agent generation → LLM judge (threshold 7.0) → cargo check compile verification. Only entries passing all three gates make it into training.
  • Hyperparameters: r=8, alpha=16, dropout=0, lr=5e-5, 1 epoch, cosine schedule, bf16, effective batch 32. See the hyperparameter rationale.
  • Hardware: Colab Pro, L4 GPU
  • Training runtime: ~15 minutes

Why it works

Three prior training runs (v1-v3) all regressed vs base. v4 fixed the root causes:

  1. Compile-verified data: every training entry compiles under cargo check. No hallucinated APIs.
  2. Conservative hyperparameters: low rank + low LR + single epoch — just enough drift to inject domain knowledge, not enough to overwrite base capabilities.
  3. General-Rust replay buffer: 28% of training mix is crate-agnostic Rust, preventing catastrophic forgetting.
  4. Proportional per-crate sizing: the ratio matches each crate's learnability vs the base model.

Limitations

  • Rig (AI agent framework) scores 3.92 — below Claude Haiku (5.00). Rig has the smallest training share (158 entries) and its API surface is the most niche. Model often uses the correct import paths but invents method signatures.
  • Evaluated on n=5 per crate. Absolute scores have roughly ±0.3 judge variance.
  • Knowledge cutoff reflects the crate versions in the training data (Axum 0.8, Leptos 0.7, Rig 0.13 era).
  • Trained on full sequences (prompt + response), not completion-only — Unsloth/Gemma 4 VLM constraint.

Benchmark

Training data, eval prompts, judge rubric, and scripts: y0sif/OxideCoder.

License

Inherits Gemma Terms of Use from the base model.

Downloads last month
325
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for y0sif/Arcwright

Adapter
(11)
this model

Dataset used to train y0sif/Arcwright