zwpride's picture
Update readme.md (#1)
dc2c482
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- industrial-code
- reasoning
- thinking
- verilog
- cuda
- triton
- chip-design
- cad
---
# InCoder-32B-Thinking: Reasoning Code Model for Industrial Scenarios
<div align="center">
[![HuggingFace](https://img.shields.io/badge/πŸ€—-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking)
[![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder)
[![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://huggingface.co/papers/2603.16790)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
</div>
## Model Summary
**InCoder-32B-Thinking** is the reasoning variant of the InCoder family. It extends [InCoder-32B](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) with chain-of-thought reasoning via `<think>...</think>` tags, enabling step-by-step problem decomposition before generating code. This is particularly effective for complex industrial tasks that require multi-step reasoning β€” debugging RTL modules, optimizing GPU kernels, or diagnosing embedded firmware issues.
For the instruction-tuned variant (without thinking), see [IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder). For the pre-trained base model, see [IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base).
---
## Key Results
### General Code Benchmarks
| Benchmark | InCoder-32B | InCoder-32B-Thinking |
|---|:---:|:---:|
| HumanEval+ | 89.6 | **91.5** |
| MBPP+ | 78.3 | **80.1** |
| BigCodeBench (Full) | 49.8 | **51.2** |
| LiveCodeBench (Pass@1) | 49.14 | **52.3** |
### Industrial Code Benchmarks
| Benchmark | Domain | InCoder-32B | InCoder-32B-Thinking |
|---|---|:---:|:---:|
| VeriScope Score | Chip Design | 80.7 | **82.3** |
| CAD-Coder Compile (%) | 3D Modeling | 82.0 | **84.0** |
| KernelBench L1 (%) | GPU Optimization | 22.2 | **24.0** |
> The thinking variant shows consistent improvements across both general and industrial benchmarks, with the largest gains on tasks requiring multi-step reasoning.
---
## Model Architecture
Same architecture as InCoder-32B, with thinking-aware post-training:
| Hyperparameter | Value |
|---|---|
| Parameters | ~32B |
| Layers | 64 |
| Hidden Size | 5,120 |
| Attention Heads | 40 (8 KV heads, GQA) |
| Max Context Length | 131,072 (128K) |
| Positional Encoding | RoPE (ΞΈ = 500,000) |
| Precision | BFloat16 |
---
## How Thinking Mode Works
InCoder-32B-Thinking generates a reasoning trace inside `<think>...</think>` tags before producing the final answer. This allows the model to:
1. **Decompose** complex problems into sub-tasks
2. **Reason** about constraints, edge cases, and hardware semantics
3. **Plan** the solution structure before writing code
Example output:
```
<think>
The user wants a UART transmitter module. Let me think through the design:
1. Need a state machine: IDLE -> START_BIT -> DATA_BITS -> STOP_BIT
2. 8N1 means: 8 data bits, no parity, 1 stop bit
3. Need a baud rate counter derived from the clock frequency
4. Shift register to serialize the 8-bit data LSB first
</think>
module uart_tx (
input wire clk,
...
```
You can **disable** thinking mode to get direct answers (behaves like the instruct variant):
```python
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=False
)
```
---
## Usage
### Installation
```bash
pip install transformers accelerate
```
### Thinking Mode (default)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Thinking"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{"role": "user", "content": "Optimize this CUDA kernel for better memory coalescing:\n__global__ void add(float *a, float *b, float *c, int N) {\n int i = threadIdx.x;\n if (i < N) c[i] = a[i] + b[i];\n}"}
]
# Thinking mode (default) β€” model reasons before answering
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.85, top_k=20)
output = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)
# Parse thinking and response
if "</think>" in output:
thinking = output.split("</think>")[0].replace("<think>\n", "").strip()
response = output.split("</think>")[1].strip()
print(f"Thinking:\n{thinking}\n\nResponse:\n{response}")
else:
print(output)
```
### Non-Thinking Mode
```python
# Disable thinking β€” direct answer without reasoning trace
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True,
enable_thinking=False
)
```
### With Tool Calls
```python
tools = [{
"type": "function",
"function": {
"name": "run_verilog_sim",
"description": "Run Verilog simulation with Icarus Verilog",
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Verilog source code"},
"testbench": {"type": "string", "description": "Testbench code"}
}
}
}
}]
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, tools=tools
)
```
### Deployment with vLLM
```bash
vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Thinking \
--tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code
```
### Recommended Sampling Parameters
| Use case | temperature | top_p | top_k | max_new_tokens |
|---|:---:|:---:|:---:|:---:|
| Thinking (default) | 0.6 | 0.85 | 20 | 8192 |
| Non-thinking / precise | 0.2 | 0.95 | β€” | 4096 |
---
## Model Family
| Model | Type | HuggingFace |
|---|---|---|
| InCoder-32B-Base | Pre-trained | [πŸ€— IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base) |
| InCoder-32B | Instruct | [πŸ€— IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) |
| **InCoder-32B-Thinking** | **Reasoning** | [πŸ€— IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) |
| InCoder-32B-FP8 | FP8 Quantized | [πŸ€— IndustrialCoder-32B-FP8](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-FP8) |
| InCoder-32B-AWQ-INT4 | AWQ INT4 | [πŸ€— IndustrialCoder-32B-AWQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-AWQ-INT4) |
| InCoder-32B-GPTQ-INT4 | GPTQ INT4 | [πŸ€— IndustrialCoder-32B-GPTQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-GPTQ-INT4) |
---
## Limitations & Disclaimers
- The thinking trace may occasionally contain reasoning errors or hallucinated constraints β€” always verify the final code output.
- For simple tasks, thinking mode adds latency; use `enable_thinking=False` for straightforward generation.
- Based on failure analysis, the model may struggle with:
- **API Knowledge**: Linker errors from undefined HAL/CMSIS functions in embedded C.
- **Functional Semantics**: Producing compilable but functionally incorrect RTL under complex logic scenarios.
- **Optimization**: Correct but sub-optimal GPU kernel performance.
Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment.
---
## Citation
```bibtex
@article{yang2026incoder,
title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
and others},
journal={arXiv preprint arXiv:2603.16790},
year={2026}
}
```