README.md · Multilingual-Multimodal-NLP/IndustrialCoder-Thinking at main

File size: 8,299 Bytes

dc2c482

---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- code
- industrial-code
- reasoning
- thinking
- verilog
- cuda
- triton
- chip-design
- cad
---

# InCoder-32B-Thinking: Reasoning Code Model for Industrial Scenarios

<div align="center">

[![HuggingFace](https://img.shields.io/badge/🤗-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking)
[![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder)
[![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://huggingface.co/papers/2603.16790)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)

</div>

## Model Summary

**InCoder-32B-Thinking** is the reasoning variant of the InCoder family. It extends [InCoder-32B](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) with chain-of-thought reasoning via `<think>...</think>` tags, enabling step-by-step problem decomposition before generating code. This is particularly effective for complex industrial tasks that require multi-step reasoning — debugging RTL modules, optimizing GPU kernels, or diagnosing embedded firmware issues.

For the instruction-tuned variant (without thinking), see [IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder). For the pre-trained base model, see [IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base).

---

## Key Results

### General Code Benchmarks

| Benchmark | InCoder-32B | InCoder-32B-Thinking |
|---|:---:|:---:|
| HumanEval+ | 89.6 | **91.5** |
| MBPP+ | 78.3 | **80.1** |
| BigCodeBench (Full) | 49.8 | **51.2** |
| LiveCodeBench (Pass@1) | 49.14 | **52.3** |

### Industrial Code Benchmarks

| Benchmark | Domain | InCoder-32B | InCoder-32B-Thinking |
|---|---|:---:|:---:|
| VeriScope Score | Chip Design | 80.7 | **82.3** |
| CAD-Coder Compile (%) | 3D Modeling | 82.0 | **84.0** |
| KernelBench L1 (%) | GPU Optimization | 22.2 | **24.0** |

> The thinking variant shows consistent improvements across both general and industrial benchmarks, with the largest gains on tasks requiring multi-step reasoning.

---

## Model Architecture

Same architecture as InCoder-32B, with thinking-aware post-training:

| Hyperparameter | Value |
|---|---|
| Parameters | ~32B |
| Layers | 64 |
| Hidden Size | 5,120 |
| Attention Heads | 40 (8 KV heads, GQA) |
| Max Context Length | 131,072 (128K) |
| Positional Encoding | RoPE (θ = 500,000) |
| Precision | BFloat16 |

---

## How Thinking Mode Works

InCoder-32B-Thinking generates a reasoning trace inside `<think>...</think>` tags before producing the final answer. This allows the model to:

1. **Decompose** complex problems into sub-tasks
2. **Reason** about constraints, edge cases, and hardware semantics
3. **Plan** the solution structure before writing code

Example output:
```
<think>
The user wants a UART transmitter module. Let me think through the design:
1. Need a state machine: IDLE -> START_BIT -> DATA_BITS -> STOP_BIT
2. 8N1 means: 8 data bits, no parity, 1 stop bit
3. Need a baud rate counter derived from the clock frequency
4. Shift register to serialize the 8-bit data LSB first
</think>

module uart_tx (
    input wire clk,
    ...
```

You can **disable** thinking mode to get direct answers (behaves like the instruct variant):
```python
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)
```

---

## Usage

### Installation

```bash
pip install transformers accelerate
```

### Thinking Mode (default)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Thinking"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [
    {"role": "user", "content": "Optimize this CUDA kernel for better memory coalescing:\n__global__ void add(float *a, float *b, float *c, int N) {\n    int i = threadIdx.x;\n    if (i < N) c[i] = a[i] + b[i];\n}"}
]

# Thinking mode (default) — model reasons before answering
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.85, top_k=20)

output = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False)

# Parse thinking and response
if "</think>" in output:
    thinking = output.split("</think>")[0].replace("<think>\n", "").strip()
    response = output.split("</think>")[1].strip()
    print(f"Thinking:\n{thinking}\n\nResponse:\n{response}")
else:
    print(output)
```

### Non-Thinking Mode

```python
# Disable thinking — direct answer without reasoning trace
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
    enable_thinking=False
)
```

### With Tool Calls

```python
tools = [{
    "type": "function",
    "function": {
        "name": "run_verilog_sim",
        "description": "Run Verilog simulation with Icarus Verilog",
        "parameters": {
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Verilog source code"},
                "testbench": {"type": "string", "description": "Testbench code"}
            }
        }
    }
}]

text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, tools=tools
)
```

### Deployment with vLLM

```bash
vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Thinking \
    --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code
```

### Recommended Sampling Parameters

| Use case | temperature | top_p | top_k | max_new_tokens |
|---|:---:|:---:|:---:|:---:|
| Thinking (default) | 0.6 | 0.85 | 20 | 8192 |
| Non-thinking / precise | 0.2 | 0.95 | — | 4096 |

---

## Model Family

| Model | Type | HuggingFace |
|---|---|---|
| InCoder-32B-Base | Pre-trained | [🤗 IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base) |
| InCoder-32B | Instruct | [🤗 IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) |
| **InCoder-32B-Thinking** | **Reasoning** | [🤗 IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) |
| InCoder-32B-FP8 | FP8 Quantized | [🤗 IndustrialCoder-32B-FP8](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-FP8) |
| InCoder-32B-AWQ-INT4 | AWQ INT4 | [🤗 IndustrialCoder-32B-AWQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-AWQ-INT4) |
| InCoder-32B-GPTQ-INT4 | GPTQ INT4 | [🤗 IndustrialCoder-32B-GPTQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-GPTQ-INT4) |

---

## Limitations & Disclaimers

- The thinking trace may occasionally contain reasoning errors or hallucinated constraints — always verify the final code output.
- For simple tasks, thinking mode adds latency; use `enable_thinking=False` for straightforward generation.
- Based on failure analysis, the model may struggle with:
  - **API Knowledge**: Linker errors from undefined HAL/CMSIS functions in embedded C.
  - **Functional Semantics**: Producing compilable but functionally incorrect RTL under complex logic scenarios.
  - **Optimization**: Correct but sub-optimal GPU kernel performance.

Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment.

---

## Citation

```bibtex
@article{yang2026incoder,
  title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
  author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
          and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
          and others},
  journal={arXiv preprint arXiv:2603.16790},
  year={2026}
}
```