File size: 8,299 Bytes
dc2c482 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- industrial-code
- reasoning
- thinking
- verilog
- cuda
- triton
- chip-design
- cad
---
# InCoder-32B-Thinking: Reasoning Code Model for Industrial Scenarios
<div align="center">
[](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking)
[](https://github.com/CSJianYang/Industrial-Coder)
[](https://huggingface.co/papers/2603.16790)
[](LICENSE)
</div>
## Model Summary
**InCoder-32B-Thinking** is the reasoning variant of the InCoder family. It extends [InCoder-32B](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) with chain-of-thought reasoning via `<think>...</think>` tags, enabling step-by-step problem decomposition before generating code. This is particularly effective for complex industrial tasks that require multi-step reasoning β debugging RTL modules, optimizing GPU kernels, or diagnosing embedded firmware issues.
For the instruction-tuned variant (without thinking), see [IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder). For the pre-trained base model, see [IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base).
---
## Key Results
### General Code Benchmarks
| Benchmark | InCoder-32B | InCoder-32B-Thinking |
|---|:---:|:---:|
| HumanEval+ | 89.6 | **91.5** |
| MBPP+ | 78.3 | **80.1** |
| BigCodeBench (Full) | 49.8 | **51.2** |
| LiveCodeBench (Pass@1) | 49.14 | **52.3** |
### Industrial Code Benchmarks
| Benchmark | Domain | InCoder-32B | InCoder-32B-Thinking |
|---|---|:---:|:---:|
| VeriScope Score | Chip Design | 80.7 | **82.3** |
| CAD-Coder Compile (%) | 3D Modeling | 82.0 | **84.0** |
| KernelBench L1 (%) | GPU Optimization | 22.2 | **24.0** |
> The thinking variant shows consistent improvements across both general and industrial benchmarks, with the largest gains on tasks requiring multi-step reasoning.
---
## Model Architecture
Same architecture as InCoder-32B, with thinking-aware post-training:
| Hyperparameter | Value |
|---|---|
| Parameters | ~32B |
| Layers | 64 |
| Hidden Size | 5,120 |
| Attention Heads | 40 (8 KV heads, GQA) |
| Max Context Length | 131,072 (128K) |
| Positional Encoding | RoPE (ΞΈ = 500,000) |
| Precision | BFloat16 |
---
## How Thinking Mode Works
InCoder-32B-Thinking generates a reasoning trace inside `<think>...</think>` tags before producing the final answer. This allows the model to:
1. **Decompose** complex problems into sub-tasks
2. **Reason** about constraints, edge cases, and hardware semantics
3. **Plan** the solution structure before writing code
Example output:
```
<think>
The user wants a UART transmitter module. Let me think through the design:
1. Need a state machine: IDLE -> START_BIT -> DATA_BITS -> STOP_BIT
2. 8N1 means: 8 data bits, no parity, 1 stop bit
3. Need a baud rate counter derived from the clock frequency
4. Shift register to serialize the 8-bit data LSB first
</think>
module uart_tx (
input wire clk,
...
```
You can **disable** thinking mode to get direct answers (behaves like the instruct variant):
```python
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=False
)
```
---
## Usage
### Installation
```bash
pip install transformers accelerate
```
### Thinking Mode (default)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Thinking"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "user", "content": "Optimize this CUDA kernel for better memory coalescing:\n__global__ void add(float *a, float *b, float *c, int N) {\n int i = threadIdx.x;\n if (i < N) c[i] = a[i] + b[i];\n}"}
]
# Thinking mode (default) β model reasons before answering
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.85, top_k=20)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)
# Parse thinking and response
if "</think>" in output:
thinking = output.split("</think>")[0].replace("<think>\n", "").strip()
response = output.split("</think>")[1].strip()
print(f"Thinking:\n{thinking}\n\nResponse:\n{response}")
else:
print(output)
```
### Non-Thinking Mode
```python
# Disable thinking β direct answer without reasoning trace
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=False
)
```
### With Tool Calls
```python
tools = [{
"type": "function",
"function": {
"name": "run_verilog_sim",
"description": "Run Verilog simulation with Icarus Verilog",
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Verilog source code"},
"testbench": {"type": "string", "description": "Testbench code"}
}
}
}
}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, tools=tools
)
```
### Deployment with vLLM
```bash
vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Thinking \
--tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code
```
### Recommended Sampling Parameters
| Use case | temperature | top_p | top_k | max_new_tokens |
|---|:---:|:---:|:---:|:---:|
| Thinking (default) | 0.6 | 0.85 | 20 | 8192 |
| Non-thinking / precise | 0.2 | 0.95 | β | 4096 |
---
## Model Family
| Model | Type | HuggingFace |
|---|---|---|
| InCoder-32B-Base | Pre-trained | [π€ IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base) |
| InCoder-32B | Instruct | [π€ IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) |
| **InCoder-32B-Thinking** | **Reasoning** | [π€ IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) |
| InCoder-32B-FP8 | FP8 Quantized | [π€ IndustrialCoder-32B-FP8](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-FP8) |
| InCoder-32B-AWQ-INT4 | AWQ INT4 | [π€ IndustrialCoder-32B-AWQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-AWQ-INT4) |
| InCoder-32B-GPTQ-INT4 | GPTQ INT4 | [π€ IndustrialCoder-32B-GPTQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-GPTQ-INT4) |
---
## Limitations & Disclaimers
- The thinking trace may occasionally contain reasoning errors or hallucinated constraints β always verify the final code output.
- For simple tasks, thinking mode adds latency; use `enable_thinking=False` for straightforward generation.
- Based on failure analysis, the model may struggle with:
- **API Knowledge**: Linker errors from undefined HAL/CMSIS functions in embedded C.
- **Functional Semantics**: Producing compilable but functionally incorrect RTL under complex logic scenarios.
- **Optimization**: Correct but sub-optimal GPU kernel performance.
Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment.
---
## Citation
```bibtex
@article{yang2026incoder,
title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
and others},
journal={arXiv preprint arXiv:2603.16790},
year={2026}
}
```
|