| --- |
| license: apache-2.0 |
| library_name: transformers |
| pipeline_tag: text-generation |
| tags: |
| - code |
| - industrial-code |
| - reasoning |
| - thinking |
| - verilog |
| - cuda |
| - triton |
| - chip-design |
| - cad |
| --- |
| |
| # InCoder-32B-Thinking: Reasoning Code Model for Industrial Scenarios |
|
|
| <div align="center"> |
|
|
| [](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) |
| [](https://github.com/CSJianYang/Industrial-Coder) |
| [](https://huggingface.co/papers/2603.16790) |
| [](LICENSE) |
|
|
| </div> |
|
|
| ## Model Summary |
|
|
| **InCoder-32B-Thinking** is the reasoning variant of the InCoder family. It extends [InCoder-32B](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) with chain-of-thought reasoning via `<think>...</think>` tags, enabling step-by-step problem decomposition before generating code. This is particularly effective for complex industrial tasks that require multi-step reasoning β debugging RTL modules, optimizing GPU kernels, or diagnosing embedded firmware issues. |
|
|
| For the instruction-tuned variant (without thinking), see [IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder). For the pre-trained base model, see [IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base). |
|
|
| --- |
|
|
| ## Key Results |
|
|
| ### General Code Benchmarks |
|
|
| | Benchmark | InCoder-32B | InCoder-32B-Thinking | |
| |---|:---:|:---:| |
| | HumanEval+ | 89.6 | **91.5** | |
| | MBPP+ | 78.3 | **80.1** | |
| | BigCodeBench (Full) | 49.8 | **51.2** | |
| | LiveCodeBench (Pass@1) | 49.14 | **52.3** | |
|
|
| ### Industrial Code Benchmarks |
|
|
| | Benchmark | Domain | InCoder-32B | InCoder-32B-Thinking | |
| |---|---|:---:|:---:| |
| | VeriScope Score | Chip Design | 80.7 | **82.3** | |
| | CAD-Coder Compile (%) | 3D Modeling | 82.0 | **84.0** | |
| | KernelBench L1 (%) | GPU Optimization | 22.2 | **24.0** | |
|
|
| > The thinking variant shows consistent improvements across both general and industrial benchmarks, with the largest gains on tasks requiring multi-step reasoning. |
|
|
| --- |
|
|
| ## Model Architecture |
|
|
| Same architecture as InCoder-32B, with thinking-aware post-training: |
|
|
| | Hyperparameter | Value | |
| |---|---| |
| | Parameters | ~32B | |
| | Layers | 64 | |
| | Hidden Size | 5,120 | |
| | Attention Heads | 40 (8 KV heads, GQA) | |
| | Max Context Length | 131,072 (128K) | |
| | Positional Encoding | RoPE (ΞΈ = 500,000) | |
| | Precision | BFloat16 | |
|
|
| --- |
|
|
| ## How Thinking Mode Works |
|
|
| InCoder-32B-Thinking generates a reasoning trace inside `<think>...</think>` tags before producing the final answer. This allows the model to: |
|
|
| 1. **Decompose** complex problems into sub-tasks |
| 2. **Reason** about constraints, edge cases, and hardware semantics |
| 3. **Plan** the solution structure before writing code |
|
|
| Example output: |
| ``` |
| <think> |
| The user wants a UART transmitter module. Let me think through the design: |
| 1. Need a state machine: IDLE -> START_BIT -> DATA_BITS -> STOP_BIT |
| 2. 8N1 means: 8 data bits, no parity, 1 stop bit |
| 3. Need a baud rate counter derived from the clock frequency |
| 4. Shift register to serialize the 8-bit data LSB first |
| </think> |
| |
| module uart_tx ( |
| input wire clk, |
| ... |
| ``` |
|
|
| You can **disable** thinking mode to get direct answers (behaves like the instruct variant): |
| ```python |
| text = tokenizer.apply_chat_template( |
| messages, tokenize=False, add_generation_prompt=True, |
| enable_thinking=False |
| ) |
| ``` |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers accelerate |
| ``` |
|
|
| ### Thinking Mode (default) |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| import torch |
| |
| model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Thinking" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| trust_remote_code=True, |
| ) |
| |
| messages = [ |
| {"role": "user", "content": "Optimize this CUDA kernel for better memory coalescing:\n__global__ void add(float *a, float *b, float *c, int N) {\n int i = threadIdx.x;\n if (i < N) c[i] = a[i] + b[i];\n}"} |
| ] |
| |
| # Thinking mode (default) β model reasons before answering |
| text = tokenizer.apply_chat_template( |
| messages, tokenize=False, add_generation_prompt=True |
| ) |
| inputs = tokenizer([text], return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.85, top_k=20) |
| |
| output = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False) |
| |
| # Parse thinking and response |
| if "</think>" in output: |
| thinking = output.split("</think>")[0].replace("<think>\n", "").strip() |
| response = output.split("</think>")[1].strip() |
| print(f"Thinking:\n{thinking}\n\nResponse:\n{response}") |
| else: |
| print(output) |
| ``` |
|
|
| ### Non-Thinking Mode |
|
|
| ```python |
| # Disable thinking β direct answer without reasoning trace |
| text = tokenizer.apply_chat_template( |
| messages, tokenize=False, add_generation_prompt=True, |
| enable_thinking=False |
| ) |
| ``` |
|
|
| ### With Tool Calls |
|
|
| ```python |
| tools = [{ |
| "type": "function", |
| "function": { |
| "name": "run_verilog_sim", |
| "description": "Run Verilog simulation with Icarus Verilog", |
| "parameters": { |
| "type": "object", |
| "properties": { |
| "code": {"type": "string", "description": "Verilog source code"}, |
| "testbench": {"type": "string", "description": "Testbench code"} |
| } |
| } |
| } |
| }] |
| |
| text = tokenizer.apply_chat_template( |
| messages, tokenize=False, add_generation_prompt=True, tools=tools |
| ) |
| ``` |
|
|
| ### Deployment with vLLM |
|
|
| ```bash |
| vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Thinking \ |
| --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code |
| ``` |
|
|
| ### Recommended Sampling Parameters |
|
|
| | Use case | temperature | top_p | top_k | max_new_tokens | |
| |---|:---:|:---:|:---:|:---:| |
| | Thinking (default) | 0.6 | 0.85 | 20 | 8192 | |
| | Non-thinking / precise | 0.2 | 0.95 | β | 4096 | |
|
|
| --- |
|
|
| ## Model Family |
|
|
| | Model | Type | HuggingFace | |
| |---|---|---| |
| | InCoder-32B-Base | Pre-trained | [π€ IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base) | |
| | InCoder-32B | Instruct | [π€ IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) | |
| | **InCoder-32B-Thinking** | **Reasoning** | [π€ IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) | |
| | InCoder-32B-FP8 | FP8 Quantized | [π€ IndustrialCoder-32B-FP8](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-FP8) | |
| | InCoder-32B-AWQ-INT4 | AWQ INT4 | [π€ IndustrialCoder-32B-AWQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-AWQ-INT4) | |
| | InCoder-32B-GPTQ-INT4 | GPTQ INT4 | [π€ IndustrialCoder-32B-GPTQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-GPTQ-INT4) | |
|
|
| --- |
|
|
| ## Limitations & Disclaimers |
|
|
| - The thinking trace may occasionally contain reasoning errors or hallucinated constraints β always verify the final code output. |
| - For simple tasks, thinking mode adds latency; use `enable_thinking=False` for straightforward generation. |
| - Based on failure analysis, the model may struggle with: |
| - **API Knowledge**: Linker errors from undefined HAL/CMSIS functions in embedded C. |
| - **Functional Semantics**: Producing compilable but functionally incorrect RTL under complex logic scenarios. |
| - **Optimization**: Correct but sub-optimal GPU kernel performance. |
|
|
| Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment. |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{yang2026incoder, |
| title={InCoder-32B: Code Foundation Model for Industrial Scenarios}, |
| author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn |
| and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin |
| and others}, |
| journal={arXiv preprint arXiv:2603.16790}, |
| year={2026} |
| } |
| ``` |
|
|