fchis's picture
Upload index.html with huggingface_hub
7f97afb verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<style>
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
max-width: 800px;
margin: 0 auto;
padding: 20px 40px;
line-height: 1.7;
color: #1a1a2e;
background: #fafafa;
}
h1 { font-size: 2em; margin-top: 1em; color: #0f0f23; }
h2 { font-size: 1.5em; margin-top: 1.5em; color: #16213e; border-bottom: 2px solid #e2e8f0; padding-bottom: 0.3em; }
h3 { font-size: 1.2em; color: #1a1a4e; }
code { background: #e8ecf1; padding: 2px 6px; border-radius: 3px; font-size: 0.9em; }
pre { background: #1e1e2e; color: #cdd6f4; padding: 16px; border-radius: 8px; overflow-x: auto; }
pre code { background: none; color: inherit; padding: 0; }
table { border-collapse: collapse; width: 100%; margin: 1em 0; }
th, td { border: 1px solid #d1d5db; padding: 8px 12px; text-align: left; }
th { background: #e8ecf1; font-weight: 600; }
tr:nth-child(even) { background: #f3f4f6; }
blockquote { border-left: 4px solid #6366f1; margin: 1em 0; padding: 0.5em 1em; background: #eef2ff; color: #312e81; }
a { color: #4f46e5; text-decoration: none; }
a:hover { text-decoration: underline; }
strong { color: #0f172a; }
hr { border: none; border-top: 2px solid #e2e8f0; margin: 2em 0; }
</style>
</head>
<body>
<h1>I Spent 34 Steps Building a Code Generator on My MacBook — Here's What Actually Worked</h1>
<p><strong>Florinel Chis</strong> · March 2026</p>
<hr />
<p>Most fine-tuning tutorials show you the happy path. This is the full path — including 6 training rounds that taught the model absolutely nothing, OOM crashes that killed my machine, and the realization that the real problem was never about the model.</p>
<p><strong>The end result:</strong> A Laravel PHP code generator that produces 26/26 valid PHP files with 20/20 Pest tests passing. Trained on 49 examples. Runs on an Apple M2 Pro with 16GB RAM. Total cloud GPU cost: $0.</p>
<p>Here's how I actually got there.</p>
<h2>The Hardware</h2>
<ul>
<li>Apple M2 Pro, 16GB unified memory</li>
<li>Qwen2.5-Coder-7B-Instruct, 4-bit quantized</li>
<li>MLX framework with LoRA</li>
<li>Target: Laravel 13.x PHP code generation</li>
</ul>
<p>The 16GB constraint shaped every architectural decision. You can't load two 7B models. You can't train with <code>max_seq_length=4096</code>. You close LM Studio before training or your machine crashes.</p>
<h2>Phase 1: Six Sprints of Nothing (The Silent Truncation Bug)</h2>
<p>I started with 90 training examples and grew to 261 across 6 sprints. <code>val_loss</code> kept dropping. By Sprint 6, it hit <strong>0.000</strong>. Perfect.</p>
<p>Except the generated code wasn't getting better. At all.</p>
<h3>The Root Cause</h3>
<p>The system prompt (guidelines for the model) had grown organically across sprints to <strong>2,380 tokens</strong>. My <code>max_seq_length</code> was <strong>1,500</strong>.</p>
<p>MLX truncates training examples silently at <code>max_seq_length</code>. Every single training example was cut off before the code completion even started. The model was being trained to predict its own system prompt — and it got really good at that (hence val_loss=0.000).</p>
<p><strong>Six sprints. Hundreds of examples. Zero code learning.</strong></p>
<h3>The Fix</h3>
<div class="codehilite"><pre><span></span><code><span class="c1"># BEFORE: 2380 tokens of verbose guidelines</span>
<span class="n">SYSTEM</span> <span class="o">=</span> <span class="s2">&quot;&quot;&quot;You are an expert Laravel developer. When writing models,</span>
<span class="s2">always use the HasFactory trait. The HasFactory trait enables...</span>
<span class="s2">[2380 tokens of examples and explanations]&quot;&quot;&quot;</span>
<span class="c1"># AFTER: 843 tokens, compressed</span>
<span class="n">SYSTEM</span> <span class="o">=</span> <span class="s2">&quot;&quot;&quot;Laravel 13.x code generator. Output ONLY PHP.</span>
<span class="s2">- model: use HasFactory, add relationships from spec</span>
<span class="s2">- controller: import Controller, destroy() returns noContent()</span>
<span class="s2">...&quot;&quot;&quot;</span>
</code></pre></div>
<p>And the verification I should have done from the start:</p>
<div class="codehilite"><pre><span></span><code><span class="c1"># Check that completions aren&#39;t truncated</span>
<span class="k">for</span> <span class="n">example</span> <span class="ow">in</span> <span class="n">dataset</span><span class="p">:</span>
<span class="n">tokens</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="s2">&quot;text&quot;</span><span class="p">])</span>
<span class="k">assert</span> <span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span> <span class="o">&lt;</span> <span class="n">max_seq_length</span><span class="p">,</span> <span class="sa">f</span><span class="s2">&quot;Truncated at </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">tokens</span><span class="p">)</span><span class="si">}</span><span class="s2"> tokens&quot;</span>
</code></pre></div>
<p><strong>Lesson: <code>val_loss=0.000</code> means nothing is being learned, not that everything is perfect. Always verify your training data reaches the completions.</strong></p>
<h2>Phase 2: Targeted Bug Fixing (The 10-15 Example Rule)</h2>
<p>After fixing the truncation bug, real training started. val_loss: 0.080 (not 0.000!).</p>
<p>I discovered that <strong>every systematic bug can be fixed with 10-15 targeted examples</strong>:</p>
<table>
<thead>
<tr>
<th>Bug</th>
<th style="text-align: center;">Examples needed</th>
<th>Result</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>'optional'</code> validation rule (not a Laravel rule)</td>
<td style="text-align: center;">10</td>
<td>Fixed — generates <code>'nullable'</code></td>
</tr>
<tr>
<td><code>wasRecentlyCreated</code> in resources</td>
<td style="text-align: center;">5</td>
<td>Fixed — uses correct timestamps</td>
</tr>
<tr>
<td>Cross-resource missing imports</td>
<td style="text-align: center;">13</td>
<td>Fixed — 12 bugs → 0</td>
</tr>
<tr>
<td>Missing <code>HasFactory</code> trait</td>
<td style="text-align: center;">20 (fixed existing)</td>
<td>Fixed — 5 bugs → 0</td>
</tr>
</tbody>
</table>
<p>The model already knows PHP. You're nudging a trained distribution, not teaching from scratch. 10-15 diverse examples of the correct pattern is enough.</p>
<h3>The Eval Script Trap</h3>
<p>I built an automated bug checker. It flagged <code>StoreBookRequest $request</code> as "missing <code>Illuminate\Http\Request</code> import" because the regex <code>'Request $request'</code> matched as a substring.</p>
<p><strong>Test your eval script on correct code before trusting it.</strong></p>
<h3>Where I Hit the Wall</h3>
<p>After Sprint 9: 52/58 Pest tests passing. 6 failures remained. All were <strong>semantic hallucinations</strong>:</p>
<ul>
<li>Model invents a <code>user()</code> relationship that doesn't exist</li>
<li>Controller uses closure-based eager loading when array format is correct</li>
<li>Model generates <code>-&gt;withHttpStatus()</code> — a method that doesn't exist</li>
</ul>
<p>Adding more NL training examples didn't help. The model was filling prompt ambiguity with its pretraining priors. The problem wasn't the model — it was the input format.</p>
<h2>Phase 3: The Spec Pivot (The Real Breakthrough)</h2>
<p>Instead of natural language:</p>
<blockquote>
<p>"Create a Post model with author relationship, fillable title and body, soft deletes"</p>
</blockquote>
<p>I switched to structured JSON specs:</p>
<div class="codehilite"><pre><span></span><code><span class="p">{</span>
<span class="w"> </span><span class="nt">&quot;artifact&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;model&quot;</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;class&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;Post&quot;</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;table&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;posts&quot;</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;has_factory&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;soft_deletes&quot;</span><span class="p">:</span><span class="w"> </span><span class="kc">true</span><span class="p">,</span>
<span class="w"> </span><span class="nt">&quot;fillable&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="s2">&quot;title&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;body&quot;</span><span class="p">,</span><span class="w"> </span><span class="s2">&quot;user_id&quot;</span><span class="p">],</span>
<span class="w"> </span><span class="nt">&quot;relationships&quot;</span><span class="p">:</span><span class="w"> </span><span class="p">[</span>
<span class="w"> </span><span class="p">{</span><span class="nt">&quot;type&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;BelongsTo&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;model&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;User&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;method&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;author&quot;</span><span class="p">,</span><span class="w"> </span><span class="nt">&quot;foreign_key&quot;</span><span class="p">:</span><span class="w"> </span><span class="s2">&quot;user_id&quot;</span><span class="p">}</span>
<span class="w"> </span><span class="p">]</span>
<span class="p">}</span>
</code></pre></div>
<h3>First test: 28 examples, 100 iterations</h3>
<p>Result: <strong>26/26 eval perfect. Zero semantic hallucinations.</strong> (Compare: 308 NL examples still had 5 hallucinations.)</p>
<p>The model can't invent a <code>user()</code> relationship if <code>relationships[]</code> explicitly lists only <code>author</code>. The spec removes the model's ability to hallucinate about <em>what</em> to generate. It only decides <em>how</em>.</p>
<h3>The Spec Compiler</h3>
<p>I built a compiler that validates specs before generation:</p>
<div class="codehilite"><pre><span></span><code>$<span class="w"> </span>python3<span class="w"> </span>spec_compiler.py<span class="w"> </span>bad_spec.json
SpecCompileError:<span class="w"> </span>rules<span class="o">[</span><span class="s1">&#39;venue_id&#39;</span><span class="o">]</span><span class="w"> </span>contains<span class="w"> </span>conditional<span class="w"> </span>token
<span class="s1">&#39;required_on_post&#39;</span>.<span class="w"> </span>Use<span class="w"> </span><span class="s1">&#39;conditional_rules&#39;</span><span class="w"> </span>dict<span class="w"> </span>instead.
</code></pre></div>
<p>Validation: &lt;1ms. Generation: ~30s per file. Catch errors early.</p>
<h3>Final Results: adapters_spec_v4</h3>
<table>
<thead>
<tr>
<th>Metric</th>
<th style="text-align: center;">NL Pipeline (308 ex)</th>
<th style="text-align: center;">Spec Pipeline (49 ex)</th>
</tr>
</thead>
<tbody>
<tr>
<td>PHP valid</td>
<td style="text-align: center;">26/26</td>
<td style="text-align: center;">26/26</td>
</tr>
<tr>
<td>Pest pass</td>
<td style="text-align: center;">52/58</td>
<td style="text-align: center;"><strong>20/20</strong></td>
</tr>
<tr>
<td>Manual fixes</td>
<td style="text-align: center;">5</td>
<td style="text-align: center;">4</td>
</tr>
<tr>
<td>Semantic hallucinations</td>
<td style="text-align: center;">5</td>
<td style="text-align: center;"><strong>0</strong></td>
</tr>
<tr>
<td>Training time</td>
<td style="text-align: center;">~30 min</td>
<td style="text-align: center;">~15 min</td>
</tr>
</tbody>
</table>
<h2>The Debugging Checklist</h2>
<p>Distilled from 34 steps of hitting walls:</p>
<p><strong>Before training:</strong>
1. Tokenize ALL examples. Check <code>max(total_tokens) &lt; max_seq_length</code>
2. Check <code>min(completion_tokens) &gt; 0</code>. If zero, system prompt is too long.
3. Close all GPU-using processes. Check memory with <code>vm_stat</code>.
4. Use <code>--num-layers 8</code> (not <code>--lora-layers 8</code>) on 16GB machines.</p>
<p><strong>After training:</strong>
5. If <code>val_loss = 0.000</code>: training is broken, not perfect.
6. Generate 3-5 test files and inspect manually before full benchmark.
7. Run <code>php -l</code> on all output (syntax check).</p>
<p><strong>When bugs persist:</strong>
8. Classify: is it a training data gap or a model capability limit?
9. If data gap: write 10-15 targeted examples with diverse contexts.
10. If capability limit: change the input format (structured specs).
11. If hallucinations persist after targeted training: the problem is <strong>ontological</strong> — the model's pretraining domain model diverges from yours. Give it an explicit ontology (structured spec), don't fight with more NL examples.</p>
<h2>What 7B Models Do Well vs Poorly</h2>
<p><strong>Does well:</strong>
- Individual class generation with clear patterns
- PHP syntax (very rare errors after basic fine-tuning)
- Following explicit rules in the system prompt
- CRUD operations with a single model</p>
<p><strong>Does poorly:</strong>
- Multi-file consistency (imports across files)
- Knowing what NOT to add (hallucinated relationships)
- Distinguishing Laravel API versions (mixes 9.x and 13.x patterns)
- Complex relationship traversal</p>
<p><strong>The key insight:</strong> 7B models don't reason about code. They pattern-match against pretraining. Every persistent bug is a missing pattern. The fix is always: add examples. If that's not enough: change the input format to remove the decision from the model entirely.</p>
<h2>Try It Yourself</h2>
<p>Everything is open source:</p>
<ul>
<li><strong>Spec-trained model</strong>: <a href="https://huggingface.co/fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec">fchis/Laravel-13x-Qwen2.5-Coder-7B-Instruct-LoRA-Spec</a></li>
<li><strong>Training data</strong>: <a href="https://huggingface.co/datasets/fchis/laravel-buildspec-training">fchis/laravel-buildspec-training</a> (49 examples)</li>
<li><strong>Full pipeline</strong>: <a href="https://github.com/florinel-chis/laravel-ai-gen">github.com/florinel-chis/laravel-ai-gen</a></li>
</ul>
<div class="codehilite"><pre><span></span><code>pip<span class="w"> </span>install<span class="w"> </span>mlx-lm
<span class="c1"># Full pipeline: NL → specs → compile → PHP files</span>
python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span><span class="s2">&quot;Create a REST API for managing blog posts with tags&quot;</span>
<span class="c1"># Or use a spec directly</span>
python3<span class="w"> </span>pipeline_spec.py<span class="w"> </span>--spec<span class="w"> </span>my_specs.json<span class="w"> </span>--output<span class="w"> </span>./generated
</code></pre></div>
<p>Runs entirely on Apple Silicon. M1/M2/M3/M4 with 16GB+ RAM.</p>
<hr />
<p><em>This post is an abbreviated version of: "From Hallucination to Ontology: 34 Steps Building a Domain-Specific Code Generator on Consumer Hardware" (Chis, 2026). The full paper with detailed results, bug taxonomy, and infrastructure lessons is available as a preprint.</em></p>
</body>
</html>