Spaces:
Running
Running
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <style> | |
| body { | |
| font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; | |
| max-width: 800px; | |
| margin: 0 auto; | |
| padding: 20px 40px; | |
| line-height: 1.7; | |
| color: #1a1a2e; | |
| background: #fafafa; | |
| } | |
| h1 { font-size: 2em; margin-top: 1em; color: #0f0f23; } | |
| h2 { font-size: 1.5em; margin-top: 1.5em; color: #16213e; border-bottom: 2px solid #e2e8f0; padding-bottom: 0.3em; } | |
| h3 { font-size: 1.2em; color: #1a1a4e; } | |
| code { background: #e8ecf1; padding: 2px 6px; border-radius: 3px; font-size: 0.9em; } | |
| pre { background: #1e1e2e; color: #cdd6f4; padding: 16px; border-radius: 8px; overflow-x: auto; } | |
| pre code { background: none; color: inherit; padding: 0; } | |
| table { border-collapse: collapse; width: 100%; margin: 1em 0; } | |
| th, td { border: 1px solid #d1d5db; padding: 8px 12px; text-align: left; } | |
| th { background: #e8ecf1; font-weight: 600; } | |
| tr:nth-child(even) { background: #f3f4f6; } | |
| blockquote { border-left: 4px solid #6366f1; margin: 1em 0; padding: 0.5em 1em; background: #eef2ff; color: #312e81; } | |
| a { color: #4f46e5; text-decoration: none; } | |
| a:hover { text-decoration: underline; } | |
| strong { color: #0f172a; } | |
| hr { border: none; border-top: 2px solid #e2e8f0; margin: 2em 0; } | |
| </style> | |
| </head> | |
| <body> | |
| <h1>I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked</h1> | |
| <p><strong>Florinel Chis</strong> · March 2026</p> | |
| <hr /> | |
| <p>Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.</p> | |
| <p><strong>The end result:</strong> A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.</p> | |
| <p>Here's how I actually got there.</p> | |
| <h2>The Hardware</h2> | |
| <ul> | |
| <li>Apple M2 Pro, 16GB unified memory</li> | |
| <li>Qwen2.5-Coder-7B-Instruct, 4-bit quantized</li> | |
| <li>MLX framework with LoRA</li> | |
| <li>Target: Laravel 13.x PHP code generation</li> | |
| </ul> | |
| <p>The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with <code>max_seq_length=4096</code>. You close LM Studio before training or your machine crashes.</p> | |
| <h2>Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)</h2> | |
| <p>I started with 90 training examples and grew to 261 across 6 sprints. <code>val_loss</code> kept dropping. By Sprint 6, it hit <strong>0.000</strong>. Perfect.</p> | |
| <p>Except the generated code wasn't getting better. At all.</p> | |
| <h3>The Root Cause</h3> | |
| <p>The system prompt (guidelines for the model) had grown organically across sprints to <strong>2,380 tokens</strong>. My <code>max_seq_length</code> was <strong>1,500</strong>.</p> | |
| <p>MLX truncates training examples silently at <code>max_seq_length</code>. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).</p> | |
| <p><strong>Six sprints. Hundreds of examples. Zero code learning.</strong></p> | |
| <h3>The Fix</h3> | |
| <div class="codehilite"><pre><span></span><code><span class="c1"># BEFORE: 2380 tokens of verbose guidelines</span> | |
| <span class="n">SYSTEM</span> <span class="o">=</span> <span class="s2">"""You are an expert Laravel developer. When writing models,</span> | |
| <span class="s2">always use the HasFactory trait. The HasFactory trait enables...</span> | |
| <span class="s2">[2380 tokens of examples and explanations]"""</span> | |
| <span class="c1"># AFTER: 843 tokens, compressed</span> | |
| <span class="n">SYSTEM</span> <span class="o">=</span> <span class="s2">"""Laravel 13.x code generator. Output ONLY PHP.</span> | |
| <span class="s2">- model: use HasFactory, add relationships from spec</span> | |
| <span class="s2">- controller: import Controller, destroy() returns noContent()</span> | |
| <span class="s2">..."""</span> | |
| </code></pre></div> | |
| <p>And the verification I should have done from the start:</p> | |
| <div class="codehilite"><pre><span></span><code><span class="c1"># Check that completions aren't truncated</span> | |
| <span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">dataset</span><span class="p">:</span> | |
| <span class="n">tokens</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="s2">"text"</span><span class="p">])</span> | |
| <span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span> <span class="o"><</span> <span class="n">max_seq_length</span><span class="p">,</span> <span class="sa">f</span><span class="s2">"Truncated at </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span><span class="si">}</span><span class="s2"> tokens"</span> | |
| </code></pre></div> | |
| <p><strong>Lesson: <code>val_loss=0.000</code> means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.</strong></p> | |
| <h2>Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)</h2> | |
| <p>After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).</p> | |
| <p>I discovered that <strong>every systematic bug can be fixed with 10-15 targeted examples</strong>:</p> | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Bug</th> | |
| <th style="text-align: center;">Examples needed</th> | |
| <th>Result</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td><code>'optional'</code> validation rule (not a Laravel rule)</td> | |
| <td style="text-align: center;">10</td> | |
| <td>Fixed — generates <code>'nullable'</code></td> | |
| </tr> | |
| <tr> | |
| <td><code>wasRecentlyCreated</code> in resources</td> | |
| <td style="text-align: center;">5</td> | |
| <td>Fixed — uses correct timestamps</td> | |
| </tr> | |
| <tr> | |
| <td>Cross-resource missing imports</td> | |
| <td style="text-align: center;">13</td> | |
| <td>Fixed — 12 bugs → 0</td> | |
| </tr> | |
| <tr> | |
| <td>Missing <code>HasFactory</code> trait</td> | |
| <td style="text-align: center;">20 (fixed existing)</td> | |
| <td>Fixed — 5 bugs → 0</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <p>The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.</p> | |
| <h3>The Eval Script Trap</h3> | |
| <p>I built an automated bug checker. It flagged <code>StoreBookRequest $request</code> as "missing <code>Illuminate\Http\Request</code> import" because the regex <code>'Request $request'</code> matched as a substring.</p> | |
| <p><strong>Test your eval script on correct code before trusting it.</strong></p> | |
| <h3>Where I Hit the Wall</h3> | |
| <p>After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were <strong>semantic hallucinations</strong>:</p> | |
| <ul> | |
| <li>Model invents a <code>user()</code> relationship that doesn't exist</li> | |
| <li>Controller uses closure-based eager loading when array format is correct</li> | |
| <li>Model generates <code>->withHttpStatus()</code> — a method that doesn't exist</li> | |
| </ul> | |
| <p>Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.</p> | |
| <h2>Phase 3: The Spec Pivot (The Real Breakthrough)</h2> | |
| <p>Instead of natural language:</p> | |
| <blockquote> | |
| <p>"Create a Post model with author relationship, fillable title and body, soft deletes"</p> | |
| </blockquote> | |
| <p>I switched to structured JSON specs:</p> | |
| <div class="codehilite"><pre><span></span><code><span class="p">{</span> | |
| <span class="w"> </span><span class="nt">"artifact"</span><span class="p">:</span><span class="w"> </span><span class="s2">"model"</span><span class="p">,</span> | |
| <span class="w"> </span><span class="nt">"class"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Post"</span><span class="p">,</span> | |
| <span class="w"> </span><span class="nt">"table"</span><span class="p">:</span><span class="w"> </span><span class="s2">"posts"</span><span class="p">,</span> | |
| <span class="w"> </span><span class="nt">"has_factory"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span> | |
| <span class="w"> </span><span class="nt">"soft_deletes"</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span> | |
| <span class="w"> </span><span class="nt">"fillable"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">"title"</span><span class="p">,</span><span class="w"> </span><span class="s2">"body"</span><span class="p">,</span><span class="w"> </span><span class="s2">"user_id"</span><span class="p">],</span> | |
| <span class="w"> </span><span class="nt">"relationships"</span><span class="p">:</span><span class="w"> </span><span class="p">[</span> | |
| <span class="w"> </span><span class="p">{</span><span class="nt">"type"</span><span class="p">:</span><span class="w"> </span><span class="s2">"BelongsTo"</span><span class="p">,</span><span class="w"> </span><span class="nt">"model"</span><span class="p">:</span><span class="w"> </span><span class="s2">"User"</span><span class="p">,</span><span class="w"> </span><span class="nt">"method"</span><span class="p">:</span><span class="w"> </span><span class="s2">"author"</span><span class="p">,</span><span class="w"> </span><span class="nt">"foreign_key"</span><span class="p">:</span><span class="w"> </span><span class="s2">"user_id"</span><span class="p">}</span> | |
| <span class="w"> </span><span class="p">]</span> | |
| <span class="p">}</span> | |
| </code></pre></div> | |
| <h3>First test: 28 examples, 100 iterations</h3> | |
| <p>Result: <strong>26/26 eval perfect. Zero semantic hallucinations.</strong> (Compare: 308 NL examples still had 5 hallucinations.)</p> | |
| <p>The model can't invent a <code>user()</code> relationship if <code>relationships[]</code> explicitly lists only <code>author</code>. The spec removes the model's ability to hallucinate about <em>what</em> to generate. It only decides <em>how</em>.</p> | |
| <h3>The Spec Compiler</h3> | |
| <p>I built a compiler that validates specs before generation:</p> | |
| <div class="codehilite"><pre><span></span><code>$<span class="w"> </span>python3<span class="w"> </span>spec_compiler.py<span class="w"> </span>bad_spec.json | |
| SpecCompileError:<span class="w"> </span>rules<span class="o">[</span><span class="s1">'venue_id'</span><span class="o">]</span><span class="w"> </span>contains<span class="w"> </span>conditional<span class="w"> </span>token | |
| <span class="s1">'required_on_post'</span>.<span class="w"> </span>Use<span class="w"> </span><span class="s1">'conditional_rules'</span><span class="w"> </span>dict<span class="w"> </span>instead. | |
| </code></pre></div> | |
| <p>Validation: <1ms. Generation: ~30s per file. Catch errors early.</p> | |
| <h3>Final Results: adapters_spec_v4</h3> | |
| <table> | |
| <thead> | |
| <tr> | |
| <th>Metric</th> | |
| <th style="text-align: center;">NL Pipeline (308 ex)</th> | |
| <th style="text-align: center;">Spec Pipeline (49 ex)</th> | |
| </tr> | |
| </thead> | |
| <tbody> | |
| <tr> | |
| <td>PHP valid</td> | |
| <td style="text-align: center;">26/26</td> | |
| <td style="text-align: center;">26/26</td> | |
| </tr> | |
| <tr> | |
| <td>Pest pass</td> | |
| <td style="text-align: center;">52/58</td> | |
| <td style="text-align: center;"><strong>20/20</strong></td> | |
| </tr> | |
| <tr> | |
| <td>Manual fixes</td> | |
| <td style="text-align: center;">5</td> | |
| <td style="text-align: center;">4</td> | |
| </tr> | |
| <tr> | |
| <td>Semantic hallucinations</td> | |
| <td style="text-align: center;">5</td> | |
| <td style="text-align: center;"><strong>0</strong></td> | |
| </tr> | |
| <tr> | |
| <td>Training time</td> | |
| <td style="text-align: center;">~30 min</td> | |
| <td style="text-align: center;">~15 min</td> | |
| </tr> | |
| </tbody> | |
| </table> | |
| <h2>The Debugging Checklist</h2> | |
| <p>Distilled from 34 steps of hitting walls:</p> | |
| <p><strong>Before training:</strong> | |
| 1. Tokenize ALL examples. Check <code>max(total_tokens) < max_seq_length</code> | |
| 2. Check <code>min(completion_tokens) > 0</code>. If zero, system prompt is too long. | |
| 3. Close all GPU-using processes. Check memory with <code>vm_stat</code>. | |
| 4. Use <code>--num-layers 8</code> (not <code>--lora-layers 8</code>) on 16GB machines.</p> | |
| <p><strong>After training:</strong> | |
| 5. If <code>val_loss = 0.000</code>: training is broken, not perfect. | |
| 6. Generate 3-5 test files and inspect manually before full benchmark. | |
| 7. Run <code>php -l</code> on all output (syntax check).</p> | |
| <p><strong>When bugs persist:</strong> | |
| 8. Classify: is it a training data gap or a model capability limit? | |
| 9. If data gap: write 10-15 targeted examples with diverse contexts. | |
| 10. If capability limit: change the input format (structured specs). | |
| 11. If hallucinations persist after targeted training: the problem is <strong>ontological</strong> — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.</p> | |
| <h2>What 7B Models Do Well vs Poorly</h2> | |
| <p><strong>Does well:</strong> | |
| - Individual class generation with clear patterns | |
| - PHP syntax (very rare errors after basic fine-tuning) | |
| - Following explicit rules in the system prompt | |
| - CRUD operations with a single model</p> | |
| <p><strong>Does poorly:</strong> | |
| - Multi-file consistency (imports across files) | |
| - Knowing what NOT to add (hallucinated relationships) | |
| - Distinguishing Laravel API versions (mixes 9.x and 13.x patterns) | |
| - Complex relationship traversal</p> | |
| <p><strong>The key insight:</strong> 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.</p> | |
| <h2>Try It Yourself</h2> | |
| <p>Everything is open source:</p> | |
| <ul> | |
| <li><strong>Spec-trained model</strong>: <a href="https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec">fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec</a></li> | |
| <li><strong>Training data</strong>: <a href="https://huggingface.co/datasets/fchis/laravel-buildspec-training">fchis/laravel-buildspec-training</a> (49 examples)</li> | |
| <li><strong>Full pipeline</strong>: <a href="https://github.com/florinel-chis/laravel-ai-gen">github.com/florinel-chis/laravel-ai-gen</a></li> | |
| </ul> | |
| <div class="codehilite"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>mlx-lm | |
| <span class="c1"># Full pipeline: NL → specs → compile → PHP files</span> | |
| python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span><span class="s2">"Create a REST API for managing blog posts with tags"</span> | |
| <span class="c1"># Or use a spec directly</span> | |
| python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span>--spec<span class="w"> </span>my_specs.json<span class="w"> </span>--output<span class="w"> </span>./generated | |
| </code></pre></div> | |
| <p>Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.</p> | |
| <hr /> | |
| <p><em>This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.</em></p> | |
| </body> | |
| </html> |