Update readme.md

#1
by wuyuverse - opened
Files changed (1) hide show
  1. README.md +65 -62
README.md CHANGED
@@ -5,6 +5,8 @@ pipeline_tag: text-generation
5
  tags:
6
  - code
7
  - industrial-code
 
 
8
  - verilog
9
  - cuda
10
  - triton
@@ -12,11 +14,11 @@ tags:
12
  - cad
13
  ---
14
 
15
- # InCoder-32B: Code Foundation Model for Industrial Scenarios
16
 
17
  <div align="center">
18
 
19
- [![HuggingFace](https://img.shields.io/badge/πŸ€—-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder)
20
  [![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder)
21
  [![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://huggingface.co/papers/2603.16790)
22
  [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
@@ -25,7 +27,9 @@ tags:
25
 
26
  ## Model Summary
27
 
28
- **InCoder-32B** (Industrial-Coder-32B) is the first 32B-parameter code foundation model purpose-built for industrial code intelligence. While general-purpose code LLMs excel at mainstream software tasks, they often struggle with the unique demands of industrial programming β€” hardware semantics, specialized language constructs, strict resource constraints, and domain-specific correctness verification.
 
 
29
 
30
  Presented in the paper [InCoder-32B: Code Foundation Model for Industrial Scenarios](https://huggingface.co/papers/2603.16790), InCoder-32B unifies code intelligence across five industrial domains:
31
 
@@ -37,68 +41,39 @@ Presented in the paper [InCoder-32B: Code Foundation Model for Industrial Scenar
37
  | πŸ”¨ **Compiler Optimization** | x86-64 ASM, C/C++, LLVM-IR |
38
  | πŸ“ **3D Modeling / CAD** | CadQuery, OpenCascade, Python |
39
 
40
- InCoder-32B achieves highly competitive performance on general tasks while establishing the strongest open-source baselines across all evaluated industrial domains.
41
-
42
- ---
43
-
44
- ## Key Results
45
-
46
- ### General Code Benchmarks
47
-
48
- | Benchmark | InCoder-32B |
49
- |---|---|
50
- | SWE-bench Verified | **74.8%** |
51
- | LiveCodeBench (Pass@1) | **49.14%** |
52
- | BFCL v3 | **60.99%** |
53
- | HumanEval+ | **89.6%** |
54
- | MBPP+ | **78.3%** |
55
- | BigCodeBench (Full) | **49.8%** |
56
-
57
- ### Industrial Code Benchmarks
58
-
59
- | Benchmark | Domain | InCoder-32B | Best Competing Open-Weight |
60
- |---|---|---|---|
61
- | VeriScope Score | Chip Design | **80.7** | 83.2 (GLM-5) |
62
- | CAD-Coder Compile | 3D Modeling | **82.0%** | 48.0% (Kimi-K2-Thinking) |
63
- | KernelBench L1 | GPU Optimization | **22.2%** | 16.2% (GLM-5) |
64
- | KernelBench L2 | GPU Optimization | **36.0%** | 28.0% (KernelBench L2) |
65
-
66
- > InCoder-32B leads all open-weight baselines on CAD-Coder and KernelBench (all three levels), and even surpasses proprietary models like Claude-Sonnet-4.6 on CAD-Coder IoU and KernelBench L1/L2/L3.
67
-
68
  ---
69
 
70
  ## Model Architecture
71
 
72
- InCoder-32B adopts a standard decoder-only Transformer architecture with the following configuration:
73
 
74
  | Hyperparameter | Value |
75
  |---|---|
76
  | Parameters | ~32B |
77
  | Layers | 64 |
78
  | Hidden Size | 5,120 |
 
79
  | Max Context Length | 131,072 (128K) |
80
  | Positional Encoding | RoPE (ΞΈ = 500,000) |
81
  | Precision | BFloat16 |
 
82
 
83
  ---
84
 
85
  ## Training Pipeline: Code-Flow
86
 
87
- InCoder-32B is trained through a three-stage **Code-Flow** pipeline:
88
 
89
  ### Stage 1 β€” Pre-training & Annealing
90
  - **Industrial Recall**: Data pipeline using rule-based filtering, FastText classifiers, and semantic retrieval for Verilog, CUDA, firmware C, and CadQuery.
91
  - **Refinement**: OCR extraction from technical manuals, multi-level deduplication, and repository-level fork consolidation.
92
- - **Training**: 15T total tokens using Autoregressive LM + Fill-in-the-Middle (FIM) objectives.
93
 
94
  ### Stage 2 β€” Mid-Training (Context Extension)
95
  Context window extended progressively from 8K to 128K tokens:
96
  - **8K β†’ 32K**: Targets file-level tasks like completing RTL modules or kernel functions.
97
  - **32K β†’ 128K**: Unlocks long-context capabilities for extended debugging and cross-module projects.
98
 
99
- ### Stage 3 β€” Post-Training
100
- 2.5M supervised fine-tuning (SFT) samples constructed from real industrial tasks with execution-grounded verification using toolchains like Icarus Verilog, `nvcc`, and Renode (STM32 simulator).
101
-
102
  ---
103
 
104
  ## Usage
@@ -109,48 +84,50 @@ Context window extended progressively from 8K to 128K tokens:
109
  pip install transformers accelerate
110
  ```
111
 
112
- ### Basic Inference
113
 
114
  ```python
115
  from transformers import AutoTokenizer, AutoModelForCausalLM
116
  import torch
117
 
118
- model_id = "Multilingual-Multimodal-NLP/IndustrialCoder"
119
 
120
- tokenizer = AutoTokenizer.from_pretrained(model_id)
121
  model = AutoModelForCausalLM.from_pretrained(
122
  model_id,
123
  torch_dtype=torch.bfloat16,
124
- device_map="auto"
 
125
  )
126
 
127
- prompt = """Write a synthesizable Verilog module for a UART transmitter (8N1 protocol).
128
- The module should accept 8-bit parallel data and serialize it onto a TX line."""
 
 
 
 
 
 
 
 
129
 
130
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
131
  outputs = model.generate(
132
  **inputs,
133
- max_new_tokens=1024,
134
  temperature=0.2,
135
  do_sample=True,
136
  )
137
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
138
  ```
139
 
140
- ### Deployment with vLLM
141
- For production deployment, you can use vLLM to create an OpenAI-compatible API endpoint.
142
-
143
- ```
144
- vllm serve Multilingual-Multimodal-NLP/IndustrialCoder --tensor-parallel-size 8
145
- ```
146
-
147
  ### Fill-in-the-Middle (FIM)
148
 
149
- InCoder-32B supports FIM completion for code infilling tasks:
150
 
151
  ```python
152
  prefix = """// CUDA kernel for RMS Normalization
153
- __global__ void rms_norm_kernel(float* output, const float* input,
154
  const float* weight, int N, float eps) {
155
  int idx = blockIdx.x;
156
  """
@@ -158,22 +135,48 @@ suffix = """
158
  output[idx * N + tid] = normalized * weight[tid];
159
  }"""
160
 
161
- fim_prompt = f"<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>"
162
  inputs = tokenizer(fim_prompt, return_tensors="pt").to(model.device)
163
  outputs = model.generate(**inputs, max_new_tokens=256)
164
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
165
  ```
166
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
  ---
168
 
169
  ## Limitations & Disclaimers
170
 
171
- Based on failure analysis, the model may struggle with:
172
- - **API Knowledge**: Linker errors from undefined HAL/CMSIS functions in embedded C.
173
- - **Functional Semantics**: Producing compilable but functionally incorrect RTL under complex logic scenarios.
174
- - **Optimization**: Correct but sub-optimal GPU kernel performance.
175
 
176
- Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware) requires expert review before deployment.
177
 
178
  ---
179
 
@@ -182,10 +185,10 @@ Always review and test generated code in a sandboxed environment. Industrial cod
182
  ```bibtex
183
  @article{yang2026incoder,
184
  title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
185
- author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
186
- and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
187
  and others},
188
  journal={arXiv preprint arXiv:2603.16790},
189
  year={2026}
190
  }
191
- ```
 
5
  tags:
6
  - code
7
  - industrial-code
8
+ - pretrained
9
+ - base-model
10
  - verilog
11
  - cuda
12
  - triton
 
14
  - cad
15
  ---
16
 
17
+ # InCoder-32B-Base: Code Foundation Model for Industrial Scenarios
18
 
19
  <div align="center">
20
 
21
+ [![HuggingFace](https://img.shields.io/badge/πŸ€—-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base)
22
  [![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder)
23
  [![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://huggingface.co/papers/2603.16790)
24
  [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
 
27
 
28
  ## Model Summary
29
 
30
+ **InCoder-32B-Base** is the pre-trained base model of the InCoder family β€” the first 32B-parameter code foundation model purpose-built for industrial code intelligence. This is the base (non-instruction-tuned) checkpoint, suitable for code completion, fill-in-the-middle (FIM), and further fine-tuning.
31
+
32
+ For the instruction-tuned variant, see [IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder). For the reasoning variant, see [IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking).
33
 
34
  Presented in the paper [InCoder-32B: Code Foundation Model for Industrial Scenarios](https://huggingface.co/papers/2603.16790), InCoder-32B unifies code intelligence across five industrial domains:
35
 
 
41
  | πŸ”¨ **Compiler Optimization** | x86-64 ASM, C/C++, LLVM-IR |
42
  | πŸ“ **3D Modeling / CAD** | CadQuery, OpenCascade, Python |
43
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  ---
45
 
46
  ## Model Architecture
47
 
48
+ InCoder-32B-Base adopts a standard decoder-only Transformer architecture:
49
 
50
  | Hyperparameter | Value |
51
  |---|---|
52
  | Parameters | ~32B |
53
  | Layers | 64 |
54
  | Hidden Size | 5,120 |
55
+ | Attention Heads | 40 (8 KV heads, GQA) |
56
  | Max Context Length | 131,072 (128K) |
57
  | Positional Encoding | RoPE (ΞΈ = 500,000) |
58
  | Precision | BFloat16 |
59
+ | Vocabulary Size | 76,800 |
60
 
61
  ---
62
 
63
  ## Training Pipeline: Code-Flow
64
 
65
+ InCoder-32B-Base is trained through a two-stage **Code-Flow** pipeline:
66
 
67
  ### Stage 1 β€” Pre-training & Annealing
68
  - **Industrial Recall**: Data pipeline using rule-based filtering, FastText classifiers, and semantic retrieval for Verilog, CUDA, firmware C, and CadQuery.
69
  - **Refinement**: OCR extraction from technical manuals, multi-level deduplication, and repository-level fork consolidation.
70
+ - **Training**: 15T total tokens using Autoregressive LM + Fill-in-the-Middle (FIM) objectives on 4,096 GPUs.
71
 
72
  ### Stage 2 β€” Mid-Training (Context Extension)
73
  Context window extended progressively from 8K to 128K tokens:
74
  - **8K β†’ 32K**: Targets file-level tasks like completing RTL modules or kernel functions.
75
  - **32K β†’ 128K**: Unlocks long-context capabilities for extended debugging and cross-module projects.
76
 
 
 
 
77
  ---
78
 
79
  ## Usage
 
84
  pip install transformers accelerate
85
  ```
86
 
87
+ ### Code Completion
88
 
89
  ```python
90
  from transformers import AutoTokenizer, AutoModelForCausalLM
91
  import torch
92
 
93
+ model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Base"
94
 
95
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
96
  model = AutoModelForCausalLM.from_pretrained(
97
  model_id,
98
  torch_dtype=torch.bfloat16,
99
+ device_map="auto",
100
+ trust_remote_code=True,
101
  )
102
 
103
+ prompt = """// Synthesizable Verilog: UART transmitter (8N1 protocol)
104
+ module uart_tx (
105
+ input wire clk,
106
+ input wire rst_n,
107
+ input wire [7:0] data_in,
108
+ input wire tx_start,
109
+ output reg tx,
110
+ output reg tx_busy
111
+ );
112
+ """
113
 
114
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
115
  outputs = model.generate(
116
  **inputs,
117
+ max_new_tokens=512,
118
  temperature=0.2,
119
  do_sample=True,
120
  )
121
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
122
  ```
123
 
 
 
 
 
 
 
 
124
  ### Fill-in-the-Middle (FIM)
125
 
126
+ InCoder-32B-Base supports FIM completion for code infilling tasks:
127
 
128
  ```python
129
  prefix = """// CUDA kernel for RMS Normalization
130
+ __global__ void rms_norm_kernel(float* output, const float* input,
131
  const float* weight, int N, float eps) {
132
  int idx = blockIdx.x;
133
  """
 
135
  output[idx * N + tid] = normalized * weight[tid];
136
  }"""
137
 
138
+ fim_prompt = f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"
139
  inputs = tokenizer(fim_prompt, return_tensors="pt").to(model.device)
140
  outputs = model.generate(**inputs, max_new_tokens=256)
141
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
142
  ```
143
 
144
+ ### Deployment with vLLM
145
+
146
+ ```bash
147
+ vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Base \
148
+ --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code
149
+ ```
150
+
151
+ ---
152
+
153
+ ## Fine-tuning
154
+
155
+ We provide an SFT framework in the [GitHub repository](https://github.com/CSJianYang/Industrial-Coder/tree/main/sft). See the README for data preparation and training instructions.
156
+
157
+ ---
158
+
159
+ ## Model Family
160
+
161
+ | Model | Type | HuggingFace |
162
+ |---|---|---|
163
+ | InCoder-32B-Base | Pre-trained | [πŸ€— IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base) |
164
+ | InCoder-32B | Instruct | [πŸ€— IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) |
165
+ | InCoder-32B-Thinking | Reasoning | [πŸ€— IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) |
166
+ | InCoder-32B-FP8 | FP8 Quantized | [πŸ€— IndustrialCoder-32B-FP8](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-FP8) |
167
+ | InCoder-32B-AWQ-INT4 | AWQ INT4 | [πŸ€— IndustrialCoder-32B-AWQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-AWQ-INT4) |
168
+ | InCoder-32B-GPTQ-INT4 | GPTQ INT4 | [πŸ€— IndustrialCoder-32B-GPTQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-GPTQ-INT4) |
169
+
170
  ---
171
 
172
  ## Limitations & Disclaimers
173
 
174
+ This is a **base model** β€” it has not been instruction-tuned and does not follow conversational instructions. It is best suited for:
175
+ - Code completion and generation
176
+ - Fill-in-the-middle (FIM) tasks
177
+ - Further fine-tuning for downstream applications
178
 
179
+ Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment.
180
 
181
  ---
182
 
 
185
  ```bibtex
186
  @article{yang2026incoder,
187
  title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
188
+ author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
189
+ and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
190
  and others},
191
  journal={arXiv preprint arXiv:2603.16790},
192
  year={2026}
193
  }
194
+ ```