--- license: apache-2.0 library_name: transformers pipeline_tag: text-generation tags: - code - industrial-code - reasoning - thinking - verilog - cuda - triton - chip-design - cad --- # InCoder-32B-Thinking: Reasoning Code Model for Industrial Scenarios
[![HuggingFace](https://img.shields.io/badge/🤗-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) [![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder) [![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://huggingface.co/papers/2603.16790) [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
## Model Summary **InCoder-32B-Thinking** is the reasoning variant of the InCoder family. It extends [InCoder-32B](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) with chain-of-thought reasoning via `...` tags, enabling step-by-step problem decomposition before generating code. This is particularly effective for complex industrial tasks that require multi-step reasoning — debugging RTL modules, optimizing GPU kernels, or diagnosing embedded firmware issues. For the instruction-tuned variant (without thinking), see [IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder). For the pre-trained base model, see [IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base). --- ## Key Results ### General Code Benchmarks | Benchmark | InCoder-32B | InCoder-32B-Thinking | |---|:---:|:---:| | HumanEval+ | 89.6 | **91.5** | | MBPP+ | 78.3 | **80.1** | | BigCodeBench (Full) | 49.8 | **51.2** | | LiveCodeBench (Pass@1) | 49.14 | **52.3** | ### Industrial Code Benchmarks | Benchmark | Domain | InCoder-32B | InCoder-32B-Thinking | |---|---|:---:|:---:| | VeriScope Score | Chip Design | 80.7 | **82.3** | | CAD-Coder Compile (%) | 3D Modeling | 82.0 | **84.0** | | KernelBench L1 (%) | GPU Optimization | 22.2 | **24.0** | > The thinking variant shows consistent improvements across both general and industrial benchmarks, with the largest gains on tasks requiring multi-step reasoning. --- ## Model Architecture Same architecture as InCoder-32B, with thinking-aware post-training: | Hyperparameter | Value | |---|---| | Parameters | ~32B | | Layers | 64 | | Hidden Size | 5,120 | | Attention Heads | 40 (8 KV heads, GQA) | | Max Context Length | 131,072 (128K) | | Positional Encoding | RoPE (θ = 500,000) | | Precision | BFloat16 | --- ## How Thinking Mode Works InCoder-32B-Thinking generates a reasoning trace inside `...` tags before producing the final answer. This allows the model to: 1. **Decompose** complex problems into sub-tasks 2. **Reason** about constraints, edge cases, and hardware semantics 3. **Plan** the solution structure before writing code Example output: ``` The user wants a UART transmitter module. Let me think through the design: 1. Need a state machine: IDLE -> START_BIT -> DATA_BITS -> STOP_BIT 2. 8N1 means: 8 data bits, no parity, 1 stop bit 3. Need a baud rate counter derived from the clock frequency 4. Shift register to serialize the 8-bit data LSB first module uart_tx ( input wire clk, ... ``` You can **disable** thinking mode to get direct answers (behaves like the instruct variant): ```python text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False ) ``` --- ## Usage ### Installation ```bash pip install transformers accelerate ``` ### Thinking Mode (default) ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Thinking" tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) messages = [ {"role": "user", "content": "Optimize this CUDA kernel for better memory coalescing:\n__global__ void add(float *a, float *b, float *c, int N) {\n int i = threadIdx.x;\n if (i < N) c[i] = a[i] + b[i];\n}"} ] # Thinking mode (default) — model reasons before answering text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer([text], return_tensors="pt").to(model.device) with torch.no_grad(): out = model.generate(**inputs, max_new_tokens=4096, temperature=0.6, top_p=0.85, top_k=20) output = tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False) # Parse thinking and response if "" in output: thinking = output.split("")[0].replace("\n", "").strip() response = output.split("")[1].strip() print(f"Thinking:\n{thinking}\n\nResponse:\n{response}") else: print(output) ``` ### Non-Thinking Mode ```python # Disable thinking — direct answer without reasoning trace text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=False ) ``` ### With Tool Calls ```python tools = [{ "type": "function", "function": { "name": "run_verilog_sim", "description": "Run Verilog simulation with Icarus Verilog", "parameters": { "type": "object", "properties": { "code": {"type": "string", "description": "Verilog source code"}, "testbench": {"type": "string", "description": "Testbench code"} } } } }] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, tools=tools ) ``` ### Deployment with vLLM ```bash vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Thinking \ --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code ``` ### Recommended Sampling Parameters | Use case | temperature | top_p | top_k | max_new_tokens | |---|:---:|:---:|:---:|:---:| | Thinking (default) | 0.6 | 0.85 | 20 | 8192 | | Non-thinking / precise | 0.2 | 0.95 | — | 4096 | --- ## Model Family | Model | Type | HuggingFace | |---|---|---| | InCoder-32B-Base | Pre-trained | [🤗 IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base) | | InCoder-32B | Instruct | [🤗 IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) | | **InCoder-32B-Thinking** | **Reasoning** | [🤗 IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) | | InCoder-32B-FP8 | FP8 Quantized | [🤗 IndustrialCoder-32B-FP8](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-FP8) | | InCoder-32B-AWQ-INT4 | AWQ INT4 | [🤗 IndustrialCoder-32B-AWQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-AWQ-INT4) | | InCoder-32B-GPTQ-INT4 | GPTQ INT4 | [🤗 IndustrialCoder-32B-GPTQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-GPTQ-INT4) | --- ## Limitations & Disclaimers - The thinking trace may occasionally contain reasoning errors or hallucinated constraints — always verify the final code output. - For simple tasks, thinking mode adds latency; use `enable_thinking=False` for straightforward generation. - Based on failure analysis, the model may struggle with: - **API Knowledge**: Linker errors from undefined HAL/CMSIS functions in embedded C. - **Functional Semantics**: Producing compilable but functionally incorrect RTL under complex logic scenarios. - **Optimization**: Correct but sub-optimal GPU kernel performance. Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment. --- ## Citation ```bibtex @article{yang2026incoder, title={InCoder-32B: Code Foundation Model for Industrial Scenarios}, author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin and others}, journal={arXiv preprint arXiv:2603.16790}, year={2026} } ```