Add benchmark comparison table (base 9.4 vs fine-tuned 10.0)

b07b0f2 verified about 1 month ago

5.6 kB

	---
	language:
	- ja
	- en
	license: apache-2.0
	base_model: google/gemma-4-31b-it
	tags:
	- gemma4
	- code
	- agent
	- japanese
	- qlora
	- react
	- mcp
	- claude-code
	datasets:
	- custom
	pipeline_tag: text-generation
	---

	# gemma4-31b-ja-agent-coder

	Japanese-enhanced agentic coding model — Fine-tuned gemma4-31b-it for autonomous coding agents with Japanese language support.

	## Highlights

	- Agentic behavior: ReAct reasoning, multi-step tool calling, self-correction
	- Japanese coding: Code generation, review, debugging in Japanese
	- Claude Code compatible: Designed as a local subagent for Claude Code via MCP
	- Function calling: Native Ollama/OpenAI tool use format
	- Zero API cost: Runs locally on 20GB+ VRAM

	## Benchmark Results

	Evaluated on 12 task categories across agentic coding capabilities. Each criterion is scored 0-1, averaged per category (scale 0-10).

	\| Category \| Base (gemma4-31b-it) \| Fine-tuned (v2) \| Delta \|
	\|----------\|:---:\|:---:\|:---:\|
	\| ReAct Tool Call \| 10.0 \| 10.0 \| — \|
	\| Function Calling \| 8.0 \| 10.0 \| +2.0 \|
	\| Multi-step ReAct \| 8.0 \| 10.0 \| +2.0 \|
	\| JP Code Gen (API) \| 10.0 \| 10.0 \| — \|
	\| JP Code Gen (Algorithm) \| 10.0 \| 10.0 \| — \|
	\| JP Code Gen (Database) \| 9.0 \| 10.0 \| +1.0 \|
	\| JP Debug (TypeError) \| 10.0 \| 10.0 \| — \|
	\| JP Debug (KeyError) \| 10.0 \| 10.0 \| — \|
	\| JP Code Review \| 8.0 \| 10.0 \| +2.0 \|
	\| JP Git Strategy \| 10.0 \| 10.0 \| — \|
	\| JP Self-correction \| 10.0 \| 10.0 \| — \|
	\| JP Documentation \| 10.0 \| 10.0 \| — \|
	\| Overall \| 9.4 \| 10.0 \| +0.6 \|

	### Key Improvements

	- Function Calling: Clean `<tool_call>` JSON format output (base model adds extra explanation)
	- Multi-step ReAct: Structured JSON reasoning with proper Thought/Action/Observation flow
	- Code Review: Parameterized query suggestions for SQL injection fixes
	- Database CRUD: Complete Create/Read/Update/Delete coverage

	### Inference Test Results (v2 adapter)

	\| Test \| Input \| Result \|
	\|------\|-------\|--------\|
	\| ReAct \| "Read src/main.py using read_file tool" \| Correct JSON with thought + action \|
	\| JP Code Gen \| "FastAPIでヘルスチェックエンドポイントを作成" \| Clean Python with `/healthz` endpoint \|
	\| JP Debug \| "TypeError: 'NoneType' is not subscriptable の原因と修正" \| Japanese explanation + fix code \|
	\| Function Calling \| "Use read_file to read README.md" \| Clean `<tool_call>` JSON format \|

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| google/gemma-4-31b-it \|
	\| Method \| QLoRA (4-bit NF4) \|
	\| LoRA rank \| 16 \|
	\| LoRA alpha \| 32 \|
	\| Target modules \| q/k/v/o_proj, gate/up/down_proj \|
	\| Trainable params \| 133M / 31B (0.43%) \|
	\| Training data \| 1,546 custom samples (v2) \|
	\| Epochs \| 2 (3rd epoch interrupted, checkpoint-388 used) \|
	\| Learning rate \| 1.5e-4 (cosine) \|
	\| Final loss \| 0.98 \|
	\| Token accuracy \| 96.8% \|
	\| Training time \| ~1.5 hours \|
	\| Hardware \| NVIDIA RTX PRO 6000 (96GB VRAM) \|

	## Training Data Categories

	\| Category \| Samples \| Description \|
	\|----------\|---------\|-------------\|
	\| ReAct Tool Calling \| ~120 \| Single/chained tool calls \|
	\| Multi-step Agentic Trajectory \| ~100 \| Plan→Tool→Observe→Correct→Answer loops \|
	\| Self-correction \| ~40 \| Error recovery patterns \|
	\| Function Calling \| ~50 \| Ollama native tool format \|
	\| Japanese Code Generation \| ~200 \| JP instruction → Python/TS code \|
	\| Japanese Code Review \| ~100 \| Security, refactoring, best practices \|
	\| Japanese Error Explanation \| ~80 \| Error → JP diagnosis + fix \|
	\| Japanese Comprehension \| ~50 \| Reading, reasoning, summarization \|
	\| Debugging & Troubleshooting \| ~100 \| Error analysis → root cause → fix \|
	\| Git & CI/CD \| ~80 \| Branch strategy, PR, GitHub Actions \|
	\| Project Planning \| ~80 \| Requirements → task decomposition \|
	\| Technical Documentation \| ~80 \| README, API docs, specs \|
	\| Algorithms & Data Structures \| ~200 \| Binary search, DP, graph, sorting \|
	\| Web Frameworks \| ~200 \| FastAPI, Django, React, Next.js \|
	\| Database Operations \| ~150 \| SQLAlchemy, PostgreSQL, Redis \|
	\| Testing & DevOps \| ~150 \| pytest, Docker, K8s, Terraform \|

	## Use with Ollama

	```bash
	# After GGUF conversion
	ollama create gemma4-ja-agent-coder -f Modelfile
	ollama run gemma4-ja-agent-coder
	```

	## Use with helix-agents (Claude Code MCP)

	Reduce Claude Code API token consumption by delegating routine tasks to this local model.

	```json
	{
	"mcpServers": {
	"helix-agents": {
	"command": "uv",
	"args": ["run", "--directory", "/path/to/helix-agent", "python", "server.py"]
	}
	}
	}
	```

	## Use with transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	from peft import PeftModel
	import torch

	bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16)
	base = AutoModelForCausalLM.from_pretrained("google/gemma-4-31b-it",
	quantization_config=bnb, device_map="auto")
	model = PeftModel.from_pretrained(base, "Tsunamayo7/gemma4-31b-ja-agent-coder")
	tokenizer = AutoTokenizer.from_pretrained("Tsunamayo7/gemma4-31b-ja-agent-coder")
	```

	> Note: Gemma4 uses `Gemma4ClippableLinear` which requires a PEFT monkey-patch. See [this gist](https://gist.github.com/) for the workaround.

	## License

	Apache 2.0 (same as base model)

	## Author

	[tsunamayo7](https://github.com/tsunamayo7) — Builder of [helix-agents](https://github.com/tsunamayo7/helix-agents), a local LLM delegation framework for Claude Code.

	---
	language:
	- ja
	- en
	license: apache-2.0
	base_model: google/gemma-4-31b-it
	tags:
	- gemma4
	- code
	- agent
	- japanese
	- qlora
	- react
	- mcp
	- claude-code
	datasets:
	- custom
	pipeline_tag: text-generation
	---

	# gemma4-31b-ja-agent-coder

	Japanese-enhanced agentic coding model — Fine-tuned gemma4-31b-it for autonomous coding agents with Japanese language support.

	## Highlights

	- Agentic behavior: ReAct reasoning, multi-step tool calling, self-correction
	- Japanese coding: Code generation, review, debugging in Japanese
	- Claude Code compatible: Designed as a local subagent for Claude Code via MCP
	- Function calling: Native Ollama/OpenAI tool use format
	- Zero API cost: Runs locally on 20GB+ VRAM

	## Benchmark Results

	Evaluated on 12 task categories across agentic coding capabilities. Each criterion is scored 0-1, averaged per category (scale 0-10).

	\| Category \| Base (gemma4-31b-it) \| Fine-tuned (v2) \| Delta \|
	\|----------\|:---:\|:---:\|:---:\|
	\| ReAct Tool Call \| 10.0 \| 10.0 \| — \|
	\| Function Calling \| 8.0 \| 10.0 \| +2.0 \|
	\| Multi-step ReAct \| 8.0 \| 10.0 \| +2.0 \|
	\| JP Code Gen (API) \| 10.0 \| 10.0 \| — \|
	\| JP Code Gen (Algorithm) \| 10.0 \| 10.0 \| — \|
	\| JP Code Gen (Database) \| 9.0 \| 10.0 \| +1.0 \|
	\| JP Debug (TypeError) \| 10.0 \| 10.0 \| — \|
	\| JP Debug (KeyError) \| 10.0 \| 10.0 \| — \|
	\| JP Code Review \| 8.0 \| 10.0 \| +2.0 \|
	\| JP Git Strategy \| 10.0 \| 10.0 \| — \|
	\| JP Self-correction \| 10.0 \| 10.0 \| — \|
	\| JP Documentation \| 10.0 \| 10.0 \| — \|
	\| Overall \| 9.4 \| 10.0 \| +0.6 \|

	### Key Improvements

	- Function Calling: Clean `<tool_call>` JSON format output (base model adds extra explanation)
	- Multi-step ReAct: Structured JSON reasoning with proper Thought/Action/Observation flow
	- Code Review: Parameterized query suggestions for SQL injection fixes
	- Database CRUD: Complete Create/Read/Update/Delete coverage

	### Inference Test Results (v2 adapter)

	\| Test \| Input \| Result \|
	\|------\|-------\|--------\|
	\| ReAct \| "Read src/main.py using read_file tool" \| Correct JSON with thought + action \|
	\| JP Code Gen \| "FastAPIでヘルスチェックエンドポイントを作成" \| Clean Python with `/healthz` endpoint \|
	\| JP Debug \| "TypeError: 'NoneType' is not subscriptable の原因と修正" \| Japanese explanation + fix code \|
	\| Function Calling \| "Use read_file to read README.md" \| Clean `<tool_call>` JSON format \|

	## Training Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| google/gemma-4-31b-it \|
	\| Method \| QLoRA (4-bit NF4) \|
	\| LoRA rank \| 16 \|
	\| LoRA alpha \| 32 \|
	\| Target modules \| q/k/v/o_proj, gate/up/down_proj \|
	\| Trainable params \| 133M / 31B (0.43%) \|
	\| Training data \| 1,546 custom samples (v2) \|
	\| Epochs \| 2 (3rd epoch interrupted, checkpoint-388 used) \|
	\| Learning rate \| 1.5e-4 (cosine) \|
	\| Final loss \| 0.98 \|
	\| Token accuracy \| 96.8% \|
	\| Training time \| ~1.5 hours \|
	\| Hardware \| NVIDIA RTX PRO 6000 (96GB VRAM) \|

	## Training Data Categories

	\| Category \| Samples \| Description \|
	\|----------\|---------\|-------------\|
	\| ReAct Tool Calling \| ~120 \| Single/chained tool calls \|
	\| Multi-step Agentic Trajectory \| ~100 \| Plan→Tool→Observe→Correct→Answer loops \|
	\| Self-correction \| ~40 \| Error recovery patterns \|
	\| Function Calling \| ~50 \| Ollama native tool format \|
	\| Japanese Code Generation \| ~200 \| JP instruction → Python/TS code \|
	\| Japanese Code Review \| ~100 \| Security, refactoring, best practices \|
	\| Japanese Error Explanation \| ~80 \| Error → JP diagnosis + fix \|
	\| Japanese Comprehension \| ~50 \| Reading, reasoning, summarization \|
	\| Debugging & Troubleshooting \| ~100 \| Error analysis → root cause → fix \|
	\| Git & CI/CD \| ~80 \| Branch strategy, PR, GitHub Actions \|
	\| Project Planning \| ~80 \| Requirements → task decomposition \|
	\| Technical Documentation \| ~80 \| README, API docs, specs \|
	\| Algorithms & Data Structures \| ~200 \| Binary search, DP, graph, sorting \|
	\| Web Frameworks \| ~200 \| FastAPI, Django, React, Next.js \|
	\| Database Operations \| ~150 \| SQLAlchemy, PostgreSQL, Redis \|
	\| Testing & DevOps \| ~150 \| pytest, Docker, K8s, Terraform \|

	## Use with Ollama

	```bash
	# After GGUF conversion
	ollama create gemma4-ja-agent-coder -f Modelfile
	ollama run gemma4-ja-agent-coder
	```

	## Use with helix-agents (Claude Code MCP)

	Reduce Claude Code API token consumption by delegating routine tasks to this local model.

	```json
	{
	"mcpServers": {
	"helix-agents": {
	"command": "uv",
	"args": ["run", "--directory", "/path/to/helix-agent", "python", "server.py"]
	}
	}
	}
	```

	## Use with transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
	from peft import PeftModel
	import torch

	bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16)
	base = AutoModelForCausalLM.from_pretrained("google/gemma-4-31b-it",
	quantization_config=bnb, device_map="auto")
	model = PeftModel.from_pretrained(base, "Tsunamayo7/gemma4-31b-ja-agent-coder")
	tokenizer = AutoTokenizer.from_pretrained("Tsunamayo7/gemma4-31b-ja-agent-coder")
	```

	> Note: Gemma4 uses `Gemma4ClippableLinear` which requires a PEFT monkey-patch. See [this gist](https://gist.github.com/) for the workaround.

	## License

	Apache 2.0 (same as base model)

	## Author

	[tsunamayo7](https://github.com/tsunamayo7) — Builder of [helix-agents](https://github.com/tsunamayo7/helix-agents), a local LLM delegation framework for Claude Code.