File size: 8,299 Bytes
dc2c482
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- industrial-code
- reasoning
- thinking
- verilog
- cuda
- triton
- chip-design
- cad
---

# InCoder-32B-Thinking: Reasoning Code Model for Industrial Scenarios

<div align="center">

[![HuggingFace](https://img.shields.io/badge/πŸ€—-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking)
[![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder)
[![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://huggingface.co/papers/2603.16790)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)

</div>

## Model Summary

**InCoder-32B-Thinking** is the reasoning variant of the InCoder family. It extends [InCoder-32B](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) with chain-of-thought reasoning via `<think>...</think>` tags, enabling step-by-step problem decomposition before generating code. This is particularly effective for complex industrial tasks that require multi-step reasoning β€” debugging RTL modules, optimizing GPU kernels, or diagnosing embedded firmware issues.

For the instruction-tuned variant (without thinking), see [IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder). For the pre-trained base model, see [IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base).

---

## Key Results

### General Code Benchmarks

| Benchmark | InCoder-32B | InCoder-32B-Thinking |
|---|:---:|:---:|
| HumanEval+ | 89.6 | **91.5** |
| MBPP+ | 78.3 | **80.1** |
| BigCodeBench (Full) | 49.8 | **51.2** |
| LiveCodeBench (Pass@1) | 49.14 | **52.3** |

### Industrial Code Benchmarks

| Benchmark | Domain | InCoder-32B | InCoder-32B-Thinking |
|---|---|:---:|:---:|
| VeriScope Score | Chip Design | 80.7 | **82.3** |
| CAD-Coder Compile (%) | 3D Modeling | 82.0 | **84.0** |
| KernelBench L1 (%) | GPU Optimization | 22.2 | **24.0** |

> The thinking variant shows consistent improvements across both general and industrial benchmarks, with the largest gains on tasks requiring multi-step reasoning.

---

## Model Architecture

Same architecture as InCoder-32B, with thinking-aware post-training:

| Hyperparameter | Value |
|---|---|
| Parameters | ~32B |
| Layers | 64 |
| Hidden Size | 5,120 |
| Attention Heads | 40 (8 KV heads, GQA) |
| Max Context Length | 131,072 (128K) |
| Positional Encoding | RoPE (ΞΈ = 500,000) |
| Precision | BFloat16 |

---

## How Thinking Mode Works

InCoder-32B-Thinking generates a reasoning trace inside `<think>...</think>` tags before producing the final answer. This allows the model to:

1. **Decompose** complex problems into sub-tasks
2. **Reason** about constraints, edge cases, and hardware semantics
3. **Plan** the solution structure before writing code

Example output:
```
<think>
The user wants a UART transmitter module. Let me think through the design:
1. Need a state machine: IDLE -> START_BIT -> DATA_BITS -> STOP_BIT
2. 8N1 means: 8 data bits, no parity, 1 stop bit
3. Need a baud rate counter derived from the clock frequency
4. Shift register to serialize the 8-bit data LSB first
</think>

module uart_tx (
    input wire clk,
    ...
```

You can **disable** thinking mode to get direct answers (behaves like the instruct variant):
```python
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)
```

---

## Usage

### Installation

```bash
pip install transformers accelerate
```

### Thinking Mode (default)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Optimize this CUDA kernel for better memory coalescing:\n__global__ void add(float *a, float *b, float *c, int N) {\n    int i = threadIdx.x;\n    if (i < N) c[i] = a[i] + b[i];\n}"}
]

# Thinking mode (default) β€” model reasons before answering
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.85, top_k=20)

output = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)

# Parse thinking and response
if "</think>" in output:
    thinking = output.split("</think>")[0].replace("<think>\n", "").strip()
    response = output.split("</think>")[1].strip()
    print(f"Thinking:\n{thinking}\n\nResponse:\n{response}")
else:
    print(output)
```

### Non-Thinking Mode

```python
# Disable thinking β€” direct answer without reasoning trace
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)
```

### With Tool Calls

```python
tools = [{
    "type": "function",
    "function": {
        "name": "run_verilog_sim",
        "description": "Run Verilog simulation with Icarus Verilog",
        "parameters": {
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Verilog source code"},
                "testbench": {"type": "string", "description": "Testbench code"}
            }
        }
    }
}]

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, tools=tools
)
```

### Deployment with vLLM

```bash
vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Thinking \
    --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code
```

### Recommended Sampling Parameters

| Use case | temperature | top_p | top_k | max_new_tokens |
|---|:---:|:---:|:---:|:---:|
| Thinking (default) | 0.6 | 0.85 | 20 | 8192 |
| Non-thinking / precise | 0.2 | 0.95 | β€” | 4096 |

---

## Model Family

| Model | Type | HuggingFace |
|---|---|---|
| InCoder-32B-Base | Pre-trained | [πŸ€— IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base) |
| InCoder-32B | Instruct | [πŸ€— IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) |
| **InCoder-32B-Thinking** | **Reasoning** | [πŸ€— IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) |
| InCoder-32B-FP8 | FP8 Quantized | [πŸ€— IndustrialCoder-32B-FP8](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-FP8) |
| InCoder-32B-AWQ-INT4 | AWQ INT4 | [πŸ€— IndustrialCoder-32B-AWQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-AWQ-INT4) |
| InCoder-32B-GPTQ-INT4 | GPTQ INT4 | [πŸ€— IndustrialCoder-32B-GPTQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-GPTQ-INT4) |

---

## Limitations & Disclaimers

- The thinking trace may occasionally contain reasoning errors or hallucinated constraints β€” always verify the final code output.
- For simple tasks, thinking mode adds latency; use `enable_thinking=False` for straightforward generation.
- Based on failure analysis, the model may struggle with:
  - **API Knowledge**: Linker errors from undefined HAL/CMSIS functions in embedded C.
  - **Functional Semantics**: Producing compilable but functionally incorrect RTL under complex logic scenarios.
  - **Optimization**: Correct but sub-optimal GPU kernel performance.

Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment.

---

## Citation

```bibtex
@article{yang2026incoder,
  title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
  author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
          and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
          and others},
  journal={arXiv preprint arXiv:2603.16790},
  year={2026}
}
```