Spaces:

fchis
/

34-steps-code-generator

Running

App Files Files Community

34-steps-code-generator / index.html

fchis

Upload index.html with huggingface_hub

7f97afb verified about 1 month ago

raw

history blame contribute delete

15.9 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="UTF-8">
	<meta name="viewport" content="width=device-width, initial-scale=1.0">
	<style>
	body {
	font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
	max-width: 800px;
	margin: 0 auto;
	padding: 20px 40px;
	line-height: 1.7;
	color: #1a1a2e;
	background: #fafafa;
	}
	h1 { font-size: 2em; margin-top: 1em; color: #0f0f23; }
	h2 { font-size: 1.5em; margin-top: 1.5em; color: #16213e; border-bottom: 2px solid #e2e8f0; padding-bottom: 0.3em; }
	h3 { font-size: 1.2em; color: #1a1a4e; }
	code { background: #e8ecf1; padding: 2px 6px; border-radius: 3px; font-size: 0.9em; }
	pre { background: #1e1e2e; color: #cdd6f4; padding: 16px; border-radius: 8px; overflow-x: auto; }
	pre code { background: none; color: inherit; padding: 0; }
	table { border-collapse: collapse; width: 100%; margin: 1em 0; }
	th, td { border: 1px solid #d1d5db; padding: 8px 12px; text-align: left; }
	th { background: #e8ecf1; font-weight: 600; }
	tr:nth-child(even) { background: #f3f4f6; }
	blockquote { border-left: 4px solid #6366f1; margin: 1em 0; padding: 0.5em 1em; background: #eef2ff; color: #312e81; }
	a { color: #4f46e5; text-decoration: none; }
	a:hover { text-decoration: underline; }
	strong { color: #0f172a; }
	hr { border: none; border-top: 2px solid #e2e8f0; margin: 2em 0; }
	</style>
	</head>
	<body>
	<h1>I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked</h1>
	<p><strong>Florinel Chis</strong> · March 2026</p>
	<hr />
	<p>Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.</p>
	<p><strong>The end result:</strong> A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.</p>
	<p>Here's how I actually got there.</p>
	<h2>The Hardware</h2>
	<ul>
	<li>Apple M2 Pro, 16GB unified memory</li>
	<li>Qwen2.5-Coder-7B-Instruct, 4-bit quantized</li>
	<li>MLX framework with LoRA</li>
	<li>Target: Laravel 13.x PHP code generation</li>
	</ul>
	<p>The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with <code>max_seq_length=4096</code>. You close LM Studio before training or your machine crashes.</p>
	<h2>Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)</h2>
	<p>I started with 90 training examples and grew to 261 across 6 sprints. <code>val_loss</code> kept dropping. By Sprint 6, it hit <strong>0.000</strong>. Perfect.</p>
	<p>Except the generated code wasn't getting better. At all.</p>
	<h3>The Root Cause</h3>
	<p>The system prompt (guidelines for the model) had grown organically across sprints to <strong>2,380 tokens</strong>. My <code>max_seq_length</code> was <strong>1,500</strong>.</p>
	<p>MLX truncates training examples silently at <code>max_seq_length</code>. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).</p>
	<p><strong>Six sprints. Hundreds of examples. Zero code learning.</strong></p>
	<h3>The Fix</h3>
	<div class="codehilite"><pre><span></span><code><span class="c1"># BEFORE: 2380 tokens of verbose guidelines</span>
	<span class="n">SYSTEM</span> <span class="o">=</span> <span class="s2">"""You are an expert Laravel developer. When writing models,</span>
	<span class="s2">always use the HasFactory trait. The HasFactory trait enables...</span>
	<span class="s2">[2380 tokens of examples and explanations]"""</span>

	<span class="c1"># AFTER: 843 tokens, compressed</span>
	<span class="n">SYSTEM</span> <span class="o">=</span> <span class="s2">"""Laravel 13.x code generator. Output ONLY PHP.</span>
	<span class="s2">- model: use HasFactory, add relationships from spec</span>
	<span class="s2">- controller: import Controller, destroy() returns noContent()</span>
	<span class="s2">..."""</span>
	</code></pre></div>

	<p>And the verification I should have done from the start:</p>
	<div class="codehilite"><pre><span></span><code><span class="c1"># Check that completions aren't truncated</span>
	<span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">dataset</span><span class="p">:</span>
	<span class="n">tokens</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="s2">"text"</span><span class="p">])</span>
	<span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span> <span class="o"><</span> <span class="n">max_seq_length</span><span class="p">,</span> <span class="sa">f</span><span class="s2">"Truncated at </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span><span class="si">}</span><span class="s2"> tokens"</span>
	</code></pre></div>

	<p><strong>Lesson: <code>val_loss=0.000</code> means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.</strong></p>
	<h2>Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)</h2>
	<p>After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).</p>
	<p>I discovered that <strong>every systematic bug can be fixed with 10-15 targeted examples</strong>:</p>
	<table>
	<thead>
	<tr>
	<th>Bug</th>
	<th style="text-align: center;">Examples needed</th>
	<th>Result</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td><code>'optional'</code> validation rule (not a Laravel rule)</td>
	<td style="text-align: center;">10</td>
	<td>Fixed — generates <code>'nullable'</code></td>
	</tr>
	<tr>
	<td><code>wasRecentlyCreated</code> in resources</td>
	<td style="text-align: center;">5</td>
	<td>Fixed — uses correct timestamps</td>
	</tr>
	<tr>
	<td>Cross-resource missing imports</td>
	<td style="text-align: center;">13</td>
	<td>Fixed — 12 bugs → 0</td>
	</tr>
	<tr>
	<td>Missing <code>HasFactory</code> trait</td>
	<td style="text-align: center;">20 (fixed existing)</td>
	<td>Fixed — 5 bugs → 0</td>
	</tr>
	</tbody>
	</table>
	<p>The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.</p>
	<h3>The Eval Script Trap</h3>
	<p>I built an automated bug checker. It flagged <code>StoreBookRequest $request</code> as "missing <code>Illuminate\Http\Request</code> import" because the regex <code>'Request $request'</code> matched as a substring.</p>
	<p><strong>Test your eval script on correct code before trusting it.</strong></p>
	<h3>Where I Hit the Wall</h3>
	<p>After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were <strong>semantic hallucinations</strong>:</p>
	<ul>
	<li>Model invents a <code>user()</code> relationship that doesn't exist</li>
	<li>Controller uses closure-based eager loading when array format is correct</li>
	<li>Model generates <code>->withHttpStatus()</code> — a method that doesn't exist</li>
	</ul>
	<p>Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.</p>
	<h2>Phase 3: The Spec Pivot (The Real Breakthrough)</h2>
	<p>Instead of natural language:</p>
	<blockquote>
	<p>"Create a Post model with author relationship, fillable title and body, soft deletes"</p>
	</blockquote>
	<p>I switched to structured JSON specs:</p>
	<div class="codehilite"><pre><span></span><code><span class="p">{</span>
	<span class="w"> </span><span class="nt">"artifact"</span><span class="p">:</span><span class="w"> </span><span class="s2">"model"</span><span class="p">,</span>
	<span class="w"> </span><span class="nt">"class"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Post"</span><span class="p">,</span>
	<span class="w"> </span><span class="nt">"table"</span><span class="p">:</span><span class="w"> </span><span class="s2">"posts"</span><span class="p">,</span>
	<span class="w"> </span><span class="nt">"has_factory"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
	<span class="w"> </span><span class="nt">"soft_deletes"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
	<span class="w"> </span><span class="nt">"fillable"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"title"</span><span class="p">,</span><span class="w"> </span><span class="s2">"body"</span><span class="p">,</span><span class="w"> </span><span class="s2">"user_id"</span><span class="p">],</span>
	<span class="w"> </span><span class="nt">"relationships"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
	<span class="w"> </span><span class="p">{</span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"BelongsTo"</span><span class="p">,</span><span class="w"> </span><span class="nt">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"User"</span><span class="p">,</span><span class="w"> </span><span class="nt">"method"</span><span class="p">:</span><span class="w"> </span><span class="s2">"author"</span><span class="p">,</span><span class="w"> </span><span class="nt">"foreign_key"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user_id"</span><span class="p">}</span>
	<span class="w"> </span><span class="p">]</span>
	<span class="p">}</span>
	</code></pre></div>

	<h3>First test: 28 examples, 100 iterations</h3>
	<p>Result: <strong>26/26 eval perfect. Zero semantic hallucinations.</strong> (Compare: 308 NL examples still had 5 hallucinations.)</p>
	<p>The model can't invent a <code>user()</code> relationship if <code>relationships[]</code> explicitly lists only <code>author</code>. The spec removes the model's ability to hallucinate about <em>what</em> to generate. It only decides <em>how</em>.</p>
	<h3>The Spec Compiler</h3>
	<p>I built a compiler that validates specs before generation:</p>
	<div class="codehilite"><pre><span></span><code>$<span class="w"> </span>python3<span class="w"> </span>spec_compiler.py<span class="w"> </span>bad_spec.json

	SpecCompileError:<span class="w"> </span>rules<span class="o">[</span><span class="s1">'venue_id'</span><span class="o">]</span><span class="w"> </span>contains<span class="w"> </span>conditional<span class="w"> </span>token
	<span class="s1">'required_on_post'</span>.<span class="w"> </span>Use<span class="w"> </span><span class="s1">'conditional_rules'</span><span class="w"> </span>dict<span class="w"> </span>instead.
	</code></pre></div>

	<p>Validation: <1ms. Generation: ~30s per file. Catch errors early.</p>
	<h3>Final Results: adapters_spec_v4</h3>
	<table>
	<thead>
	<tr>
	<th>Metric</th>
	<th style="text-align: center;">NL Pipeline (308 ex)</th>
	<th style="text-align: center;">Spec Pipeline (49 ex)</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>PHP valid</td>
	<td style="text-align: center;">26/26</td>
	<td style="text-align: center;">26/26</td>
	</tr>
	<tr>
	<td>Pest pass</td>
	<td style="text-align: center;">52/58</td>
	<td style="text-align: center;"><strong>20/20</strong></td>
	</tr>
	<tr>
	<td>Manual fixes</td>
	<td style="text-align: center;">5</td>
	<td style="text-align: center;">4</td>
	</tr>
	<tr>
	<td>Semantic hallucinations</td>
	<td style="text-align: center;">5</td>
	<td style="text-align: center;"><strong>0</strong></td>
	</tr>
	<tr>
	<td>Training time</td>
	<td style="text-align: center;">~30 min</td>
	<td style="text-align: center;">~15 min</td>
	</tr>
	</tbody>
	</table>
	<h2>The Debugging Checklist</h2>
	<p>Distilled from 34 steps of hitting walls:</p>
	<p><strong>Before training:</strong>
	1. Tokenize ALL examples. Check <code>max(total_tokens) < max_seq_length</code>
	2. Check <code>min(completion_tokens) > 0</code>. If zero, system prompt is too long.
	3. Close all GPU-using processes. Check memory with <code>vm_stat</code>.
	4. Use <code>--num-layers 8</code> (not <code>--lora-layers 8</code>) on 16GB machines.</p>
	<p><strong>After training:</strong>
	5. If <code>val_loss = 0.000</code>: training is broken, not perfect.
	6. Generate 3-5 test files and inspect manually before full benchmark.
	7. Run <code>php -l</code> on all output (syntax check).</p>
	<p><strong>When bugs persist:</strong>
	8. Classify: is it a training data gap or a model capability limit?
	9. If data gap: write 10-15 targeted examples with diverse contexts.
	10. If capability limit: change the input format (structured specs).
	11. If hallucinations persist after targeted training: the problem is <strong>ontological</strong> — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.</p>
	<h2>What 7B Models Do Well vs Poorly</h2>
	<p><strong>Does well:</strong>
	- Individual class generation with clear patterns
	- PHP syntax (very rare errors after basic fine-tuning)
	- Following explicit rules in the system prompt
	- CRUD operations with a single model</p>
	<p><strong>Does poorly:</strong>
	- Multi-file consistency (imports across files)
	- Knowing what NOT to add (hallucinated relationships)
	- Distinguishing Laravel API versions (mixes 9.x and 13.x patterns)
	- Complex relationship traversal</p>
	<p><strong>The key insight:</strong> 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.</p>
	<h2>Try It Yourself</h2>
	<p>Everything is open source:</p>
	<ul>
	<li><strong>Spec-trained model</strong>: <a href="https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec">fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec</a></li>
	<li><strong>Training data</strong>: <a href="https://huggingface.co/datasets/fchis/laravel-buildspec-training">fchis/laravel-buildspec-training</a> (49 examples)</li>
	<li><strong>Full pipeline</strong>: <a href="https://github.com/florinel-chis/laravel-ai-gen">github.com/florinel-chis/laravel-ai-gen</a></li>
	</ul>
	<div class="codehilite"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>mlx-lm

	<span class="c1"># Full pipeline: NL → specs → compile → PHP files</span>
	python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span><span class="s2">"Create a REST API for managing blog posts with tags"</span>

	<span class="c1"># Or use a spec directly</span>
	python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span>--spec<span class="w"> </span>my_specs.json<span class="w"> </span>--output<span class="w"> </span>./generated
	</code></pre></div>

	<p>Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.</p>
	<hr />
	<p><em>This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.</em></p>
	</body>
	</html>