Spaces:

fchis
/

34-steps-code-generator

Running

App Files Files Community

34-steps-code-generator / README.md

fchis

Upload README.md with huggingface_hub

22b85f7 verified about 1 month ago

preview code

raw

history blame contribute delete

8.72 kB

	---
	title: "I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked"
	emoji: "🛠️"
	colorFrom: green
	colorTo: yellow
	sdk: static
	pinned: false
	license: mit
	tags:
	- code-generation
	- fine-tuning
	- mlx
	- lora
	- laravel
	- php
	- apple-silicon
	- experience-report
	---

	# I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked

	Florinel Chis · March 2026

	---

	Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.

	The end result: A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.

	Here's how I actually got there.

	## The Hardware

	- Apple M2 Pro, 16GB unified memory
	- Qwen2.5-Coder-7B-Instruct, 4-bit quantized
	- MLX framework with LoRA
	- Target: Laravel 13.x PHP code generation

	The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with `max_seq_length=4096`. You close LM Studio before training or your machine crashes.

	## Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)

	I started with 90 training examples and grew to 261 across 6 sprints. `val_loss` kept dropping. By Sprint 6, it hit 0.000. Perfect.

	Except the generated code wasn't getting better. At all.

	### The Root Cause

	The system prompt (guidelines for the model) had grown organically across sprints to 2,380 tokens. My `max_seq_length` was 1,500.

	MLX truncates training examples silently at `max_seq_length`. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).

	Six sprints. Hundreds of examples. Zero code learning.

	### The Fix

	```python
	# BEFORE: 2380 tokens of verbose guidelines
	SYSTEM = """You are an expert Laravel developer. When writing models,
	always use the HasFactory trait. The HasFactory trait enables...
	[2380 tokens of examples and explanations]"""

	# AFTER: 843 tokens, compressed
	SYSTEM = """Laravel 13.x code generator. Output ONLY PHP.
	- model: use HasFactory, add relationships from spec
	- controller: import Controller, destroy() returns noContent()
	..."""
	```

	And the verification I should have done from the start:

	```python
	# Check that completions aren't truncated
	for example in dataset:
	tokens = tokenizer.encode(example["text"])
	assert len(tokens) < max_seq_length, f"Truncated at {len(tokens)} tokens"
	```

	Lesson: `val_loss=0.000` means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.

	## Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)

	After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).

	I discovered that every systematic bug can be fixed with 10-15 targeted examples:

	\| Bug \| Examples needed \| Result \|
	\|-----\|:-:\|---\|
	\| `'optional'` validation rule (not a Laravel rule) \| 10 \| Fixed — generates `'nullable'` \|
	\| `wasRecentlyCreated` in resources \| 5 \| Fixed — uses correct timestamps \|
	\| Cross-resource missing imports \| 13 \| Fixed — 12 bugs → 0 \|
	\| Missing `HasFactory` trait \| 20 (fixed existing) \| Fixed — 5 bugs → 0 \|

	The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.

	### The Eval Script Trap

	I built an automated bug checker. It flagged `StoreBookRequest $request` as "missing `Illuminate\Http\Request` import" because the regex `'Request $request'` matched as a substring.

	Test your eval script on correct code before trusting it.

	### Where I Hit the Wall

	After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were semantic hallucinations:

	- Model invents a `user()` relationship that doesn't exist
	- Controller uses closure-based eager loading when array format is correct
	- Model generates `->withHttpStatus()` — a method that doesn't exist

	Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.

	## Phase 3: The Spec Pivot (The Real Breakthrough)

	Instead of natural language:

	> "Create a Post model with author relationship, fillable title and body, soft deletes"

	I switched to structured JSON specs:

	```json
	{
	"artifact": "model",
	"class": "Post",
	"table": "posts",
	"has_factory": true,
	"soft_deletes": true,
	"fillable": ["title", "body", "user_id"],
	"relationships": [
	{"type": "BelongsTo", "model": "User", "method": "author", "foreign_key": "user_id"}
	]
	}
	```

	### First test: 28 examples, 100 iterations

	Result: 26/26 eval perfect. Zero semantic hallucinations. (Compare: 308 NL examples still had 5 hallucinations.)

	The model can't invent a `user()` relationship if `relationships[]` explicitly lists only `author`. The spec removes the model's ability to hallucinate about what to generate. It only decides how.

	### The Spec Compiler

	I built a compiler that validates specs before generation:

	```
	$ python3 spec_compiler.py bad_spec.json

	SpecCompileError: rules['venue_id'] contains conditional token
	'required_on_post'. Use 'conditional_rules' dict instead.
	```

	Validation: <1ms. Generation: ~30s per file. Catch errors early.

	### Final Results: adapters_spec_v4

	\| Metric \| NL Pipeline (308 ex) \| Spec Pipeline (49 ex) \|
	\|--------\|:---:\|:---:\|
	\| PHP valid \| 26/26 \| 26/26 \|
	\| Pest pass \| 52/58 \| 20/20 \|
	\| Manual fixes \| 5 \| 4 \|
	\| Semantic hallucinations \| 5 \| 0 \|
	\| Training time \| ~30 min \| ~15 min \|

	## The Debugging Checklist

	Distilled from 34 steps of hitting walls:

	Before training:
	1. Tokenize ALL examples. Check `max(total_tokens) < max_seq_length`
	2. Check `min(completion_tokens) > 0`. If zero, system prompt is too long.
	3. Close all GPU-using processes. Check memory with `vm_stat`.
	4. Use `--num-layers 8` (not `--lora-layers 8`) on 16GB machines.

	After training:
	5. If `val_loss = 0.000`: training is broken, not perfect.
	6. Generate 3-5 test files and inspect manually before full benchmark.
	7. Run `php -l` on all output (syntax check).

	When bugs persist:
	8. Classify: is it a training data gap or a model capability limit?
	9. If data gap: write 10-15 targeted examples with diverse contexts.
	10. If capability limit: change the input format (structured specs).
	11. If hallucinations persist after targeted training: the problem is ontological — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.

	## What 7B Models Do Well vs Poorly

	Does well:
	- Individual class generation with clear patterns
	- PHP syntax (very rare errors after basic fine-tuning)
	- Following explicit rules in the system prompt
	- CRUD operations with a single model

	Does poorly:
	- Multi-file consistency (imports across files)
	- Knowing what NOT to add (hallucinated relationships)
	- Distinguishing Laravel API versions (mixes 9.x and 13.x patterns)
	- Complex relationship traversal

	The key insight: 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.

	## Try It Yourself

	Everything is open source:

	- Spec-trained model: [fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec](https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec)
	- Training data: [fchis/laravel-buildspec-training](https://huggingface.co/datasets/fchis/laravel-buildspec-training) (49 examples)
	- Full pipeline: [github.com/florinel-chis/laravel-ai-gen](https://github.com/florinel-chis/laravel-ai-gen)

	```bash
	pip install mlx-lm

	# Full pipeline: NL → specs → compile → PHP files
	python3 pipeline_spec.py "Create a REST API for managing blog posts with tags"

	# Or use a spec directly
	python3 pipeline_spec.py --spec my_specs.json --output ./generated
	```

	Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.

	---

	This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.