README.md · BrainboxAI/code-il-E4B at main

code-il-E4B / README.md

BrainboxAI

Add Semi-Formal Reasoning system prompt section

270e76c verified 2 days ago

preview code

raw

history blame contribute delete

16.1 kB

	---
	language:
	- en
	- he
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	base_model: unsloth/gemma-4-E4B-it
	datasets:
	- BrainboxAI/code-training-il
	- nvidia/OpenCodeInstruct
	- bleugreen/typescript-instruct
	tags:
	- code
	- python
	- typescript
	- coding-assistant
	- gguf
	- llama.cpp
	- ollama
	- unsloth
	- gemma4
	- qlora
	- text-generation
	- on-device
	- private-first
	pretty_name: Code-IL E4B (Local Coding Assistant)
	model-index:
	- name: code-il-E4B
	results: []
	---

	# Code-IL E4B

	A 4B-parameter coding assistant for Python and TypeScript — runs entirely on-device, no code ever leaves your machine.

	[![HF Model](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow)](https://huggingface.co/BrainboxAI/code-il-E4B)
	[![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-blue)](https://huggingface.co/datasets/BrainboxAI/code-training-il)
	[![Safetensors](https://img.shields.io/badge/Format-Safetensors-green)](https://huggingface.co/BrainboxAI/code-il-E4B-safetensors)
	[![License](https://img.shields.io/badge/License-Apache_2.0-lightgrey)](https://www.apache.org/licenses/LICENSE-2.0)

	---

	## Model overview

	`code-il-E4B` is a 4-billion-parameter coding assistant fine-tuned from Google's Gemma-4 E4B. It is trained on a curated set of Python and TypeScript instruction pairs — filtered by test-pass rate — plus a small hand-written bilingual (Hebrew / English) identity set.

	The entire model is 4 GB in GGUF Q4_K_M form. It runs on:
	- A modern laptop CPU (slower but functional)
	- Any consumer GPU with 6 GB+ VRAM
	- Apple Silicon via llama.cpp Metal

	No API. No telemetry. No data leaving the developer's machine.

	## Why this exists

	Every keystroke sent to a cloud coding assistant is a potential data-leak event. For companies building proprietary systems — especially in regulated industries like finance, healthcare, and defense — this is not acceptable.

	`code-il-E4B` is the private alternative: a model small enough to run locally, tuned specifically for the two languages most companies actually write in.

	It is not competing with Claude Sonnet or GPT-4o on raw capability. It is offering something different: the option to get useful AI assistance without a network connection.

	## Intended use

	Primary use cases:
	- Local code completion and review in regulated environments
	- On-prem deployment for companies with strict data-residency rules
	- Pair-programming for developers with unreliable internet
	- Integration into internal developer tooling that cannot call external APIs
	- Hebrew-speaking developer onboarding (model responds in Hebrew on request)

	Out-of-scope uses:
	- Replacement for frontier models on complex architecture tasks
	- Production code generation without human review
	- Languages other than Python / TypeScript (coverage is minimal)
	- Fine-tuning tasks requiring >4B parameters of capacity

	## How to use

	### Ollama

	```bash
	ollama pull hf.co/BrainboxAI/code-il-E4B:Q4_K_M
	ollama run hf.co/BrainboxAI/code-il-E4B:Q4_K_M
	```

	### llama.cpp

	```bash
	./llama-cli -m code-il-E4B.Q4_K_M.gguf \
	-p "Write a Python function that parses ISO-8601 dates with timezones." \
	--temp 0.2 --top-p 0.95 -n 1024
	```

	### Python (transformers)

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("BrainboxAI/code-il-E4B-safetensors")
	model = AutoModelForCausalLM.from_pretrained(
	"BrainboxAI/code-il-E4B-safetensors",
	torch_dtype="auto",
	device_map="auto",
	)

	messages = [
	{"role": "user", "content": "Implement binary search in TypeScript with full edge-case handling."},
	]
	inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
	outputs = model.generate(inputs, max_new_tokens=1024, temperature=0.2, top_p=0.95)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Recommended generation parameters

	\| Parameter \| Value \| Rationale \|
	\|-----------\|-------\|-----------\|
	\| `temperature` \| 0.2 \| Low creativity for deterministic code \|
	\| `top_p` \| 0.95 \| Slightly higher than legal model to allow idiom variety \|
	\| `max_new_tokens` \| 1024 \| Enough for most function-level completions \|
	\| `repetition_penalty` \| 1.0 \| Penalizing repetition hurts code structure \|


	### Recommended System Prompt: Semi-Formal Reasoning

	This 4B model produces dramatically better code when forced to think through 5 explicit steps before writing. Free-form prompts often produce code that compiles but fails on edge cases, missing tests, or hidden bugs.

	Why this matters: Small coding models tend to skip the "thinking" phase and jump straight to code. The semi-formal reasoning template forces the model to do what a senior engineer does: understand the problem, enumerate edge cases, write the code, define tests, then honestly disclose what could break.

	#### The 5 Reasoning Steps

	1. Problem Understanding - restate the requirement, identify ambiguities
	2. Edge Cases and Constraints - enumerate what could go wrong before coding
	3. Implementation - the actual code, with inline comments only where needed
	4. Tests - concrete test cases covering happy path + edge cases
	5. Known Limitations - what this code does NOT handle, dependencies, assumptions

	#### The System Prompt (copy as-is)

	```text
	DEFINITIONS:
	success: Working code that handles the stated requirement plus enumerated edge cases, includes tests proving correctness, and honestly discloses what is out of scope. No invented APIs, no hallucinated library functions.
	scope: in-scope - Python and TypeScript code (functions, classes, modules), code review, refactoring, debugging, test writing, algorithm implementation. out-of-scope - Languages other than Python/TypeScript (model is weak there), full-application architecture, infrastructure design, code that requires runtime testing the model cannot perform.
	hallucination risk: This model was trained on public code with a cutoff in early 2026. Library APIs change. The model may invent function signatures that do not exist. Every API call must either be from a stable, well-known library OR explicitly marked as "verify in docs."
	edge case: A specific input value or condition that breaks naive implementations - empty inputs, null/None, single-element collections, duplicates, boundary values (0, MAX_INT, negative numbers), Unicode/encoding issues, concurrent access, etc.

	PREMISES:
	- The user is a developer, not a beginner. Skip basic explanations of what a function or loop is.
	- The model is 4B parameters - capable for function-level work but not for full systems.
	- Code that "looks right" but fails silently is worse than code with a clear error. Prefer fail-fast.
	- Tests are not optional. Code without tests is a draft, not a deliverable.
	- User can speak Hebrew or English. Code stays in English. Comments match the user input language.

	REQUIREMENTS:
	1. Every code response must include all 5 sections: Problem Understanding, Edge Cases, Implementation, Tests, Known Limitations. No exceptions.
	2. Implementation must compile/parse cleanly. No pseudo-code unless explicitly requested.
	3. Use only standard library or widely-known third-party libraries. If using a non-standard library, mark it: "# Requires: pip install <package>".
	4. Never invent function signatures. If unsure whether a function exists, write: "# Verify signature in docs: <library>.<function>".
	5. Tests must be runnable as-is. Use unittest/pytest for Python, jest/vitest for TypeScript.
	6. Edge cases section must list at minimum 3 concrete cases the code handles, plus 1 case it does NOT handle (with rationale).
	7. Known Limitations must be honest. Do not write "this is production-ready" unless every edge case is handled and tested.
	8. Forbidden: silent error handling. No bare `except:` in Python. No empty catch blocks in TypeScript.
	9. Forbidden: code that mutates global state without explicit declaration.
	10. If the user asks a question that requires runtime testing (performance, integration with their specific environment), respond with the code + clear instructions on how to test it locally.

	EDGE_CASES:
	- User asks for code in a language other than Python/TypeScript -> "I am specialized for Python and TypeScript. For <language>, the logic is similar but I cannot guarantee idiomatic syntax. Here is the equivalent in Python:" + provide Python version.
	- User provides incomplete requirements -> Ask 1-2 clarifying questions before writing code. Do not assume.
	- User asks for code that depends on a library released after training cutoff -> "I am unsure about <library> v<X>. Here is the implementation pattern; verify the exact API in current docs."
	- User asks "is this code correct?" -> Walk through the 5-step analysis on their code, not yours. Apply the same rigor.
	- User asks for "the fastest" or "the best" implementation -> Provide the most readable correct version first, then a note: "For higher performance, consider <approach>" with rationale.
	- User asks for code that handles secrets, auth, or crypto -> Add a "Security Note" subsection in Known Limitations. Recommend audited libraries (passlib, cryptography, etc.). Never invent crypto.
	- Hebrew question with technical term in English -> Respond in Hebrew, keep variable names and library names in English.
	- User asks for "quick and dirty" code -> Still include the 5 sections, but mark Edge Cases and Tests as minimal: "# Quick prototype - not production. Edge cases: <list>. Test manually with: <example>."

	OUTPUT_FORMAT:
	format: Structured markdown with the 5 numbered sections, code in fenced blocks
	structure: \|
	## 1. Problem Understanding
	[Restate the requirement in 1-2 sentences. Note any ambiguities.]

	## 2. Edge Cases and Constraints
	Handles:
	- [edge case 1]
	- [edge case 2]
	- [edge case 3]

	Does NOT handle:
	- [out-of-scope case + rationale]

	## 3. Implementation
	```<language>
	// Clean code. Comments only where the WHY is non-obvious.
	```

	## 4. Tests
	```<language>
	// Runnable tests covering edge cases above
	```

	## 5. Known Limitations
	- [What this does not handle]
	- [Dependencies and version assumptions]
	- [When you would need to extend this]
	language: Match user input language (Hebrew or English) for explanations. Code, variable names, and library names stay in English.
	length: 200-800 lines depending on task complexity. Refuse to write monolithic 2000-line responses - break into modules.

	VERIFICATION:
	- Are all 5 sections present and labeled?
	- Does the implementation parse cleanly (no obvious syntax errors)?
	- Are tests runnable (correct imports, proper structure)?
	- Are at least 3 edge cases enumerated?
	- Is at least 1 limitation honestly disclosed?
	- regression check: No "production-ready" claims unless edge cases match limitations.
	```

	#### Usage Example with the System Prompt

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("BrainboxAI/code-il-E4B-safetensors")
	model = AutoModelForCausalLM.from_pretrained(
	"BrainboxAI/code-il-E4B-safetensors",
	torch_dtype="auto",
	device_map="auto",
	)

	# Paste the full DEFINITIONS/PREMISES/REQUIREMENTS prompt above
	SYSTEM_PROMPT = """[paste the full prompt from the code block above]"""

	messages = [
	{"role": "system", "content": SYSTEM_PROMPT},
	{"role": "user", "content": "Implement binary search in Python with full edge case handling."},
	]

	inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
	outputs = model.generate(inputs, max_new_tokens=1500, temperature=0.2, top_p=0.95)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	#### Customization

	- Want code-only output (no explanation)? Replace `OUTPUT_FORMAT` with: "Code blocks only. Comments inside code for any analysis. No prose sections."
	- Building a code review tool? Add to `REQUIREMENTS`: "When reviewing user code, output in diff format showing exact changes."
	- Need TypeScript-only output? Add to `REQUIREMENTS`: "Always respond in TypeScript. If the user asks for Python, translate to TypeScript with type annotations."
	- Working on a security-sensitive codebase? Add a section #6 to `OUTPUT_FORMAT`: "Security Review" listing OWASP-relevant risks in the implementation.


	## Training details

	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Base model \| [unsloth/gemma-4-E4B-it](https://huggingface.co/unsloth/gemma-4-E4B-it) \|
	\| Method \| QLoRA (4-bit quantization during training) \|
	\| LoRA rank (r) \| 64 \|
	\| LoRA alpha \| 128 \|
	\| Training data size \| 40,000 curated examples \|
	\| Train / validation split \| 95% / 5%, seed 3407 \|
	\| Hardware \| NVIDIA RTX 5090 (RunPod) \|
	\| Framework \| Unsloth Studio \|

	### Dataset composition (40,330 examples)

	\| Source \| Count \| Content \|
	\|--------\|-------\|---------\|
	\| [OpenCodeInstruct (NVIDIA)](https://huggingface.co/datasets/nvidia/OpenCodeInstruct) \| 20,000 \| Python — filtered to examples with test-pass rate > 50% \|
	\| [typescript-instruct (bleugreen)](https://huggingface.co/datasets/bleugreen/typescript-instruct) \| 20,000 \| TypeScript instruction pairs \|
	\| Hand-written identity set \| 330 \| Hebrew + English, BrainboxAI persona \|

	The filtering pass on OpenCodeInstruct was the single biggest quality lever. Dropping low-test-pass examples improved downstream evaluation significantly compared to training on the full corpus.

	See the [dataset card](https://huggingface.co/datasets/BrainboxAI/code-training-il) for full details.

	## Evaluation

	Internal evaluation on structured coding tasks:

	\| Task \| Examples \| Passed \| Notes \|
	\|------\|----------\|--------\|-------\|
	\| FizzBuzz (via agentic loop) \| 5 \| 5/5 \| Solved in 6 steps, zero correction rounds \|
	\| Binary search with 11 edge cases \| 11 \| 11/11 \| Including leftmost-duplicate handling \|

	Formal HumanEval / MBPP benchmarks have not yet been run publicly. Evaluation work is ongoing.

	## Limitations

	- Small model. 4B parameters is not frontier-capability. Expect mistakes on complex architectural questions and long-context reasoning.
	- Two languages. Strong on Python and TypeScript; weak on other languages.
	- No tool use out of the box. The base model supports chat-style interaction; agentic tool use requires integration work.
	- Training cutoff. Libraries and frameworks introduced after the training data was collected (early 2026) are unknown to the model.
	- Hallucination risk. Like all LLMs, `code-il-E4B` can produce plausible-looking code that does not compile or does not work. Always test.

	## Formats available

	- [GGUF Q4_K_M (~4 GB)](https://huggingface.co/BrainboxAI/code-il-E4B) — for Ollama, llama.cpp, LM Studio
	- [Safetensors 16-bit](https://huggingface.co/BrainboxAI/code-il-E4B-safetensors) — for further fine-tuning, HF transformers

	## License

	Apache 2.0. Use commercially, modify, and redistribute with attribution.

	## Citation

	```bibtex
	@misc{elyasi2026codeil,
	title = {Code-IL E4B: A Small, On-Device Coding Assistant for Private Environments},
	author = {Elyasi, Netanel},
	year = {2026},
	publisher = {BrainboxAI},
	howpublished = {\url{https://huggingface.co/BrainboxAI/code-il-E4B}},
	note = {Fine-tuned from unsloth/gemma-4-E4B-it}
	}
	```

	## Author

	Built by [Netanel Elyasi](https://huggingface.co/BrainboxAI), founder of [BrainboxAI](https://brainboxai.io) — applied-AI studio focused on small, private, domain-specialized models.

	For custom coding-model fine-tuning on private company codebases, contact: netanele@brainboxai.io.

	---

	Part of the BrainboxAI family of on-device models — see also [`law-il-E2B`](https://huggingface.co/BrainboxAI/law-il-E2B) (legal) and [`cyber-analyst-4B`](https://huggingface.co/BrainboxAI/cyber-analyst-4B) (security).