fchis's picture
Upload README.md with huggingface_hub
22b85f7 verified
---
title: "I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked"
emoji: "🛠️"
colorFrom: green
colorTo: yellow
sdk: static
pinned: false
license: mit
tags:
- code-generation
- fine-tuning
- mlx
- lora
- laravel
- php
- apple-silicon
- experience-report
---
# I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked
**Florinel Chis** · March 2026
---
Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.
**The end result:** A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.
Here's how I actually got there.
## The Hardware
- Apple M2 Pro, 16GB unified memory
- Qwen2.5-Coder-7B-Instruct, 4-bit quantized
- MLX framework with LoRA
- Target: Laravel 13.x PHP code generation
The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with `max_seq_length=4096`. You close LM Studio before training or your machine crashes.
## Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)
I started with 90 training examples and grew to 261 across 6 sprints. `val_loss` kept dropping. By Sprint 6, it hit **0.000**. Perfect.
Except the generated code wasn't getting better. At all.
### The Root Cause
The system prompt (guidelines for the model) had grown organically across sprints to **2,380 tokens**. My `max_seq_length` was **1,500**.
MLX truncates training examples silently at `max_seq_length`. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).
**Six sprints. Hundreds of examples. Zero code learning.**
### The Fix
```python
# BEFORE: 2380 tokens of verbose guidelines
SYSTEM = """You are an expert Laravel developer. When writing models,
always use the HasFactory trait. The HasFactory trait enables...
[2380 tokens of examples and explanations]"""
# AFTER: 843 tokens, compressed
SYSTEM = """Laravel 13.x code generator. Output ONLY PHP.
- model: use HasFactory, add relationships from spec
- controller: import Controller, destroy() returns noContent()
..."""
```
And the verification I should have done from the start:
```python
# Check that completions aren't truncated
for example in dataset:
tokens = tokenizer.encode(example["text"])
assert len(tokens) < max_seq_length, f"Truncated at {len(tokens)} tokens"
```
**Lesson: `val_loss=0.000` means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.**
## Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)
After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).
I discovered that **every systematic bug can be fixed with 10-15 targeted examples**:
| Bug | Examples needed | Result |
|-----|:-:|---|
| `'optional'` validation rule (not a Laravel rule) | 10 | Fixed — generates `'nullable'` |
| `wasRecentlyCreated` in resources | 5 | Fixed — uses correct timestamps |
| Cross-resource missing imports | 13 | Fixed — 12 bugs → 0 |
| Missing `HasFactory` trait | 20 (fixed existing) | Fixed — 5 bugs → 0 |
The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.
### The Eval Script Trap
I built an automated bug checker. It flagged `StoreBookRequest $request` as "missing `Illuminate\Http\Request` import" because the regex `'Request $request'` matched as a substring.
**Test your eval script on correct code before trusting it.**
### Where I Hit the Wall
After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were **semantic hallucinations**:
- Model invents a `user()` relationship that doesn't exist
- Controller uses closure-based eager loading when array format is correct
- Model generates `->withHttpStatus()` — a method that doesn't exist
Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.
## Phase 3: The Spec Pivot (The Real Breakthrough)
Instead of natural language:
> "Create a Post model with author relationship, fillable title and body, soft deletes"
I switched to structured JSON specs:
```json
{
"artifact": "model",
"class": "Post",
"table": "posts",
"has_factory": true,
"soft_deletes": true,
"fillable": ["title", "body", "user_id"],
"relationships": [
{"type": "BelongsTo", "model": "User", "method": "author", "foreign_key": "user_id"}
]
}
```
### First test: 28 examples, 100 iterations
Result: **26/26 eval perfect. Zero semantic hallucinations.** (Compare: 308 NL examples still had 5 hallucinations.)
The model can't invent a `user()` relationship if `relationships[]` explicitly lists only `author`. The spec removes the model's ability to hallucinate about *what* to generate. It only decides *how*.
### The Spec Compiler
I built a compiler that validates specs before generation:
```
$ python3 spec_compiler.py bad_spec.json
SpecCompileError: rules['venue_id'] contains conditional token
'required_on_post'. Use 'conditional_rules' dict instead.
```
Validation: <1ms. Generation: ~30s per file. Catch errors early.
### Final Results: adapters_spec_v4
| Metric | NL Pipeline (308 ex) | Spec Pipeline (49 ex) |
|--------|:---:|:---:|
| PHP valid | 26/26 | 26/26 |
| Pest pass | 52/58 | **20/20** |
| Manual fixes | 5 | 4 |
| Semantic hallucinations | 5 | **0** |
| Training time | ~30 min | ~15 min |
## The Debugging Checklist
Distilled from 34 steps of hitting walls:
**Before training:**
1. Tokenize ALL examples. Check `max(total_tokens) < max_seq_length`
2. Check `min(completion_tokens) > 0`. If zero, system prompt is too long.
3. Close all GPU-using processes. Check memory with `vm_stat`.
4. Use `--num-layers 8` (not `--lora-layers 8`) on 16GB machines.
**After training:**
5. If `val_loss = 0.000`: training is broken, not perfect.
6. Generate 3-5 test files and inspect manually before full benchmark.
7. Run `php -l` on all output (syntax check).
**When bugs persist:**
8. Classify: is it a training data gap or a model capability limit?
9. If data gap: write 10-15 targeted examples with diverse contexts.
10. If capability limit: change the input format (structured specs).
11. If hallucinations persist after targeted training: the problem is **ontological** — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.
## What 7B Models Do Well vs Poorly
**Does well:**
- Individual class generation with clear patterns
- PHP syntax (very rare errors after basic fine-tuning)
- Following explicit rules in the system prompt
- CRUD operations with a single model
**Does poorly:**
- Multi-file consistency (imports across files)
- Knowing what NOT to add (hallucinated relationships)
- Distinguishing Laravel API versions (mixes 9.x and 13.x patterns)
- Complex relationship traversal
**The key insight:** 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.
## Try It Yourself
Everything is open source:
- **Spec-trained model**: [fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec](https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec)
- **Training data**: [fchis/laravel-buildspec-training](https://huggingface.co/datasets/fchis/laravel-buildspec-training) (49 examples)
- **Full pipeline**: [github.com/florinel-chis/laravel-ai-gen](https://github.com/florinel-chis/laravel-ai-gen)
```bash
pip install mlx-lm
# Full pipeline: NL → specs → compile → PHP files
python3 pipeline_spec.py "Create a REST API for managing blog posts with tags"
# Or use a spec directly
python3 pipeline_spec.py --spec my_specs.json --output ./generated
```
Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.
---
*This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.*