Arcwright
Arcwright is a Gemma 4 E4B-it model fine-tuned for modern Rust web and AI frameworks: Leptos, Axum, and Rig.
On RustWebBench-15, Arcwright scores 6.87 / 10 overall — beating not only its base model (Gemma 4 E4B, 5.00) but also:
- Gemma 4 26B-A4B (6.59), the 6.5× larger model from the same family
- Claude Haiku (6.73)
- Gemini (5.96)
- Qwen3-Coder 30B-A3B (5.75)
Leaderboard
| Rank | Model | Leptos | Axum | Rig | Overall |
|---|---|---|---|---|---|
| 1 | Arcwright | 8.40 | 8.28 | 3.92 | 6.87 |
| 2 | Claude Haiku | 7.16 | 8.04 | 5.00 | 6.73 |
| 3 | Gemma 4 26B-A4B | 7.72 | 8.20 | 3.84 | 6.59 |
| 4 | Gemini | 6.96 | 7.40 | 3.52 | 5.96 |
| 5 | Qwen3-Coder 30B-A3B | 7.36 | 6.20 | 3.68 | 5.75 |
| 6 | Gemma 4 E4B-it (base) | 5.24 | 6.84 | 2.92 | 5.00 |
| 7 | Qwen3 8B | 5.52 | 5.28 | 3.08 | 4.63 |
| 8 | Qwen2.5-Coder 7B | 4.28 | 4.68 | 1.64 | 3.53 |
All models evaluated on the same 15 prompts (5 per crate), judged on 5 dimensions (1-10): correctness, completeness, idiomatic, crate_knowledge, explanation.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"y0sif/Arcwright", torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("y0sif/Arcwright")
msgs = [{"role": "user", "content": [
{"type": "text", "text": "Write a Leptos counter component with increment/decrement buttons."}
]}]
inputs = tokenizer.apply_chat_template(
msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
out = model.generate(inputs, max_new_tokens=1024)
print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
A lightweight LoRA-only version is available at y0sif/Arcwright-LoRA — apply on top of unsloth/gemma-4-E4B-it.
Training details
- Base:
unsloth/gemma-4-E4B-it(4-bit) - Method: QLoRA, conservative settings (see below)
- Dataset:
y0sif/Arcwright-v4-Combined— 1,007 train / 109 test- Leptos: 334 curated pairs
- Axum: 317 curated pairs
- Rig: 158 compile-verified pairs (115 from
examples/+ 43 compile-passing supplements) - General Rust: 307 pairs from Strandset-Rust-v1 (replay buffer to prevent catastrophic forgetting)
- Training data pipeline: 3-gate quality pipeline — sub-agent generation → LLM judge (threshold 7.0) →
cargo checkcompile verification. Only entries passing all three gates make it into training. - Hyperparameters: r=8, alpha=16, dropout=0, lr=5e-5, 1 epoch, cosine schedule, bf16, effective batch 32. See the hyperparameter rationale.
- Hardware: Colab Pro, L4 GPU
- Training runtime: ~15 minutes
Why it works
Three prior training runs (v1-v3) all regressed vs base. v4 fixed the root causes:
- Compile-verified data: every training entry compiles under
cargo check. No hallucinated APIs. - Conservative hyperparameters: low rank + low LR + single epoch — just enough drift to inject domain knowledge, not enough to overwrite base capabilities.
- General-Rust replay buffer: 28% of training mix is crate-agnostic Rust, preventing catastrophic forgetting.
- Proportional per-crate sizing: the ratio matches each crate's learnability vs the base model.
Limitations
- Rig (AI agent framework) scores 3.92 — below Claude Haiku (5.00). Rig has the smallest training share (158 entries) and its API surface is the most niche. Model often uses the correct import paths but invents method signatures.
- Evaluated on n=5 per crate. Absolute scores have roughly ±0.3 judge variance.
- Knowledge cutoff reflects the crate versions in the training data (Axum 0.8, Leptos 0.7, Rig 0.13 era).
- Trained on full sequences (prompt + response), not completion-only — Unsloth/Gemma 4 VLM constraint.
Benchmark
Training data, eval prompts, judge rubric, and scripts: y0sif/OxideCoder.
License
Inherits Gemma Terms of Use from the base model.
- Downloads last month
- 325