Instructions to use amarsaikhan/spark-code-A-3b-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use amarsaikhan/spark-code-A-3b-v2 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct") model = PeftModel.from_pretrained(base_model, "amarsaikhan/spark-code-A-3b-v2") - Notebooks
- Google Colab
- Kaggle
SPARK-Code · Condition A-v2 (Exec-only GRPO, full pool) · Qwen2.5-Coder-3B QLoRA
QLoRA adapter trained with execution-grounded GRPO on the full 311-problem MBPP pool over a 6-iteration schedule. Published weights are the iteration-4 checkpoint — the strongest HumanEval result in the entire SPARK-Code study (pass@1 0.816).
TL;DR
spark-code-A-3b-v2 is the scaled-up rerun of the exec-only GRPO baseline: same recipe as spark-code-A-3b but on the full 311-problem MBPP training pool and a longer 6-iteration schedule (kl_coeff=0.02). HumanEval pass@1 peaks at 0.816 at iteration 4 — the best score across all five adapters in the study — with the KL to the frozen reference staying below 2.4e-3 the whole way. The run terminated at iteration 6 with a CUDA out-of-memory error (GPU contention, not a code fault), so no final/ adapter was auto-saved; the published weights are the iteration-4 checkpoint, chosen as the peak of the eval trajectory.
Training Setup
- Base model:
Qwen/Qwen2.5-Coder-3B-Instruct - Method: Execution-grounded GRPO. Per problem, sample a group of rollouts, score each by the fraction of unit tests it passes (penalties for syntax/runtime/timeout), normalize rewards within the group, apply a clipped PPO-style update against a frozen reference. No auxiliary SFT objective (this is the exec-only condition).
- Training data: MBPP-sanitized, 311 problems (full pool), 6 iterations intended (5 completed + eval; crash during iteration 6), K=4 adaptive rollouts (up to 8), partial per-test rewards with
syntax_penalty=-0.2,runtime_penalty=-0.1,timeout_penalty=-0.3,wrong_answer_floor=0.0. - LoRA:
r=16,alpha=32,dropout=0.05, target modulesq_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj. - Quantization: 4-bit NF4 + double quant, bf16 compute.
- Optimizer: AdamW,
lr=5e-6,grad_accum=4,clip_ratio=0.2,max_grad_norm=1.0. - KL regularization:
kl_coeff=0.02against a frozen-reference policy (k=3 estimator, log-probs cached at rollout time). - Auxiliary objective: none (this is Condition A).
- Seed: 42.
- Published checkpoint:
condition_A/checkpoints/iter4(the run crashed before afinal/was written; see Limitations).
Training script: run_experiment_with_mbpp_heldout.py in the GitHub repo.
Evaluation Results
HumanEval is evaluated with 5 samples per problem at temperature=0.2, top_p=0.95. Held-out MBPP uses 100 problems disjoint from the training pool with the same sampling settings. GRPO KL is the mean per-token KL from the frozen reference policy on training rollouts. Iterations 0–5 completed; iteration 6 crashed during the GRPO step before its eval.
| Iter | HumanEval pass@1 | HumanEval pass@5 | MBPP-held pass@1 | MBPP-held pass@5 | Train pass rate | GRPO KL |
|---|---|---|---|---|---|---|
| 0 | 0.796 | 0.854 | 0.634 | 0.680 | — | — |
| 1 | 0.806 | 0.872 | 0.628 | 0.680 | 0.593 | 0.0003 |
| 2 | 0.801 | 0.860 | 0.642 | 0.690 | 0.620 | 0.0007 |
| 3 | 0.793 | 0.872 | 0.618 | 0.680 | 0.633 | 0.0013 |
| 4 | 0.816 | 0.872 | 0.638 | 0.710 | 0.649 | 0.0023 |
| 5 | 0.796 | 0.854 | 0.636 | 0.690 | 0.672 | 0.0024 |
| 6 | n/a | n/a | n/a | n/a | 0.696 | n/a |
Trajectory. HumanEval pass@1 oscillates in a narrow band and peaks at 0.816 at iteration 4 (+2.0 pp over baseline), the highest of any adapter in the study; pass@5 holds at 0.872 across iters 1–4. Held-out MBPP pass@5 also peaks at iter 4 (0.71). Crucially, GRPO KL stays below 2.4e-3 for the entire run — exec-only GRPO shows no policy drift even over six iterations on the full pool, in sharp contrast to the regularized co-evolve run (C-reg2), whose KL climbed to ~0.096 over the same schedule. Mean tokens per GRPO sequence stay in the 179–186 range (no completion-length collapse). The published iteration-4 checkpoint captures the peak.
Limitations
The training run hit torch.OutOfMemoryError during iteration 6's GRPO backward pass — the GPU was shared with another large process at the time, so this was resource contention rather than a fault in the recipe. No final/ adapter was written. The weights published here are the iteration-4 checkpoint, selected because it is both the eval peak and a fully-consistent post-iteration snapshot. Iteration 5 (pass@1 0.796) is also available in the source repo if a more-trained-but-lower checkpoint is preferred. Iteration 6 has no eval (the crash preceded it).
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-Coder-3B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base, "amarsaikhan/spark-code-A-3b-v2")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-3B-Instruct")
prompt = tok.apply_chat_template(
[{"role": "system", "content": "You are an expert Python programmer. Return only correct Python code."},
{"role": "user", "content": "Write a Python function is_palindrome(s) that returns True if s reads the same forwards and backwards."}],
tokenize=False, add_generation_prompt=True,
)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, temperature=0.2, do_sample=True, top_p=0.95)
print(tok.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Comparison to Other Conditions
All five adapters share the same base model and seed. The original three used a 200-problem pool over 3 iterations; the two -v2/2 adapters use the full 311-problem pool over 6 iterations. Each row reports that adapter's published checkpoint.
| Condition | Pool / iters | aux_loss_scale | kl_coeff | HumanEval pass@1 | MBPP-held pass@5 |
|---|---|---|---|---|---|
| A-v2 (exec-only, full) — this card | 311 / it 4 | 0.00 | 0.02 | 0.816 | 0.710 |
| A (exec-only) | 200 / it 3 | 0.00 | 0.01 | 0.805 | 0.690 |
| C-reg (regularized) | 200 / it 3 | 0.03 | 0.02 | 0.800 | 0.720 |
| C-light (naive) | 200 / it 3 | 0.10 | 0.01 | 0.773 | 0.680 |
| C-reg2 (regularized, full) | 311 / it 6 | 0.02 | 0.03 | 0.774 | 0.680 |
A-v2 is the strongest HumanEval pass@1 in the study and ties the best MBPP pass@5.
Findings Summary
- The simplest method, scaled up, is still the strongest. Exec-only GRPO on the full pool produced the best HumanEval pass@1 (0.816) of any adapter — no auxiliary recycling required.
- Exec-only does not drift, even over six iterations. KL stays below 2.4e-3 throughout. The matched-schedule regularized co-evolve run (C-reg2) drifted to KL ~0.096 and regressed on HumanEval over the same six iterations — direct evidence that the auxiliary objective, not the longer schedule, is what destabilizes the policy.
- Published checkpoint is the iteration-4 peak. The run crashed at iteration 6 (GPU OOM from contention); the weights here are iter4, the eval peak. This is a checkpoint-selection decision, not a completed-run "final."
Related Artifacts
- Sibling adapters: spark-code-A-3b · spark-code-C-light-3b · spark-code-C-reg-3b · spark-code-C-reg2-3b
- GitHub repository: https://github.com/amarsaikhanb/spark-code
- Full per-problem eval data (HumanEval and held-out MBPP JSONs, iters 0–5) lives under
condition_A/eval/in the repository - Interactive demo Space: [SPACES_URL]
Citation
@misc{batjargal2026sparkcode,
title = {SPARK-Code: Co-Evolving Policy and Reward for Code Generation},
author = {Amarsaikhan Batjargal},
year = {2026},
}
License
The LoRA adapter weights in this repository are released under the Apache 2.0 license. The base model, Qwen/Qwen2.5-Coder-3B-Instruct, is distributed under the Tongyi Qianwen LICENSE; any downstream use must comply with its terms.
- Downloads last month
- 34