Multilingual-Multimodal-NLP/IndustrialCoder-Base

Update readme.md

by wuyuverse - opened 5 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+65

-62

Files changed (1) hide show

README.md +65 -62

README.md CHANGED Viewed

@@ -5,6 +5,8 @@ pipeline_tag: text-generation
 tags:
 - code
 - industrial-code
 - verilog
 - cuda
 - triton
@@ -12,11 +14,11 @@ tags:
 - cad
 ---
-# InCoder-32B: Code Foundation Model for Industrial Scenarios
 <div align="center">
-[![HuggingFace](https://img.shields.io/badge/🤗-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder)
 [![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder)
 [![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://huggingface.co/papers/2603.16790)
 [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
@@ -25,7 +27,9 @@ tags:
 ## Model Summary
-**InCoder-32B** (Industrial-Coder-32B) is the first 32B-parameter code foundation model purpose-built for industrial code intelligence. While general-purpose code LLMs excel at mainstream software tasks, they often struggle with the unique demands of industrial programming — hardware semantics, specialized language constructs, strict resource constraints, and domain-specific correctness verification.
 Presented in the paper [InCoder-32B: Code Foundation Model for Industrial Scenarios](https://huggingface.co/papers/2603.16790), InCoder-32B unifies code intelligence across five industrial domains:
@@ -37,68 +41,39 @@ Presented in the paper [InCoder-32B: Code Foundation Model for Industrial Scenar
 | 🔨 **Compiler Optimization** | x86-64 ASM, C/C++, LLVM-IR |
 | 📐 **3D Modeling / CAD** | CadQuery, OpenCascade, Python |
-InCoder-32B achieves highly competitive performance on general tasks while establishing the strongest open-source baselines across all evaluated industrial domains.
----
-## Key Results
-### General Code Benchmarks
-| Benchmark | InCoder-32B |
-|---|---|
-| SWE-bench Verified | **74.8%** |
-| LiveCodeBench (Pass@1) | **49.14%** |
-| BFCL v3 | **60.99%** |
-| HumanEval+ | **89.6%** |
-| MBPP+ | **78.3%** |
-| BigCodeBench (Full) | **49.8%** |
-### Industrial Code Benchmarks
-| Benchmark | Domain | InCoder-32B | Best Competing Open-Weight |
-|---|---|---|---|
-| VeriScope Score | Chip Design | **80.7** | 83.2 (GLM-5) |
-| CAD-Coder Compile | 3D Modeling | **82.0%** | 48.0% (Kimi-K2-Thinking) |
-| KernelBench L1 | GPU Optimization | **22.2%** | 16.2% (GLM-5) |
-| KernelBench L2 | GPU Optimization | **36.0%** | 28.0% (KernelBench L2) |
-> InCoder-32B leads all open-weight baselines on CAD-Coder and KernelBench (all three levels), and even surpasses proprietary models like Claude-Sonnet-4.6 on CAD-Coder IoU and KernelBench L1/L2/L3.
 ---
 ## Model Architecture
-InCoder-32B adopts a standard decoder-only Transformer architecture with the following configuration:
 | Hyperparameter | Value |
 |---|---|
 | Parameters | ~32B |
 | Layers | 64 |
 | Hidden Size | 5,120 |
 | Max Context Length | 131,072 (128K) |
 | Positional Encoding | RoPE (θ = 500,000) |
 | Precision | BFloat16 |
 ---
 ## Training Pipeline: Code-Flow
-InCoder-32B is trained through a three-stage **Code-Flow** pipeline:
 ### Stage 1 — Pre-training & Annealing
 - **Industrial Recall**: Data pipeline using rule-based filtering, FastText classifiers, and semantic retrieval for Verilog, CUDA, firmware C, and CadQuery.
 - **Refinement**: OCR extraction from technical manuals, multi-level deduplication, and repository-level fork consolidation.
-- **Training**: 15T total tokens using Autoregressive LM + Fill-in-the-Middle (FIM) objectives.
 ### Stage 2 — Mid-Training (Context Extension)
 Context window extended progressively from 8K to 128K tokens:
 - **8K → 32K**: Targets file-level tasks like completing RTL modules or kernel functions.
 - **32K → 128K**: Unlocks long-context capabilities for extended debugging and cross-module projects.
-### Stage 3 — Post-Training
-2.5M supervised fine-tuning (SFT) samples constructed from real industrial tasks with execution-grounded verification using toolchains like Icarus Verilog, `nvcc`, and Renode (STM32 simulator).
 ---
 ## Usage
@@ -109,48 +84,50 @@ Context window extended progressively from 8K to 128K tokens:
 pip install transformers accelerate
 ```
-### Basic Inference
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
-model_id = "Multilingual-Multimodal-NLP/IndustrialCoder"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(
     model_id,
     torch_dtype=torch.bfloat16,
-    device_map="auto"
 )
-prompt = """Write a synthesizable Verilog module for a UART transmitter (8N1 protocol).
-The module should accept 8-bit parallel data and serialize it onto a TX line."""
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 outputs = model.generate(
     **inputs,
-    max_new_tokens=1024,
     temperature=0.2,
     do_sample=True,
 )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-### Deployment with vLLM
-For production deployment, you can use vLLM to create an OpenAI-compatible API endpoint.
-```
-vllm serve Multilingual-Multimodal-NLP/IndustrialCoder --tensor-parallel-size 8
-```
 ### Fill-in-the-Middle (FIM)
-InCoder-32B supports FIM completion for code infilling tasks:
 ```python
 prefix = """// CUDA kernel for RMS Normalization
-__global__ void rms_norm_kernel(float* output, const float* input,
                                  const float* weight, int N, float eps) {
     int idx = blockIdx.x;
 """
@@ -158,22 +135,48 @@ suffix = """
     output[idx * N + tid] = normalized * weight[tid];
 }"""
-fim_prompt = f"<fim_prefix>{prefix}<fim_suffix>{suffix}<fim_middle>"
 inputs = tokenizer(fim_prompt, return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=256)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ---
 ## Limitations & Disclaimers
-Based on failure analysis, the model may struggle with:
-- **API Knowledge**: Linker errors from undefined HAL/CMSIS functions in embedded C.
-- **Functional Semantics**: Producing compilable but functionally incorrect RTL under complex logic scenarios.
-- **Optimization**: Correct but sub-optimal GPU kernel performance.
-Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware) requires expert review before deployment.
 ---
@@ -182,10 +185,10 @@ Always review and test generated code in a sandboxed environment. Industrial cod
 ```bibtex
 @article{yang2026incoder,
   title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
-  author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
-          and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
           and others},
   journal={arXiv preprint arXiv:2603.16790},
   year={2026}
 }
-```

 tags:
 - code
 - industrial-code
+- pretrained
+- base-model
 - verilog
 - cuda
 - triton
 - cad
 ---
+# InCoder-32B-Base: Code Foundation Model for Industrial Scenarios
 <div align="center">
+[![HuggingFace](https://img.shields.io/badge/🤗-Model%20Hub-yellow)](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base)
 [![GitHub](https://img.shields.io/badge/GitHub-Industrial--Coder-blue)](https://github.com/CSJianYang/Industrial-Coder)
 [![arXiv](https://img.shields.io/badge/arXiv-2603.16790-red)](https://huggingface.co/papers/2603.16790)
 [![License](https://img.shields.io/badge/License-Apache%202.0-green)](LICENSE)
 ## Model Summary
+**InCoder-32B-Base** is the pre-trained base model of the InCoder family — the first 32B-parameter code foundation model purpose-built for industrial code intelligence. This is the base (non-instruction-tuned) checkpoint, suitable for code completion, fill-in-the-middle (FIM), and further fine-tuning.
+For the instruction-tuned variant, see [IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder). For the reasoning variant, see [IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking).
 Presented in the paper [InCoder-32B: Code Foundation Model for Industrial Scenarios](https://huggingface.co/papers/2603.16790), InCoder-32B unifies code intelligence across five industrial domains:
 | 🔨 **Compiler Optimization** | x86-64 ASM, C/C++, LLVM-IR |
 | 📐 **3D Modeling / CAD** | CadQuery, OpenCascade, Python |
 ---
 ## Model Architecture
+InCoder-32B-Base adopts a standard decoder-only Transformer architecture:
 | Hyperparameter | Value |
 |---|---|
 | Parameters | ~32B |
 | Layers | 64 |
 | Hidden Size | 5,120 |
+| Attention Heads | 40 (8 KV heads, GQA) |
 | Max Context Length | 131,072 (128K) |
 | Positional Encoding | RoPE (θ = 500,000) |
 | Precision | BFloat16 |
+| Vocabulary Size | 76,800 |
 ---
 ## Training Pipeline: Code-Flow
+InCoder-32B-Base is trained through a two-stage **Code-Flow** pipeline:
 ### Stage 1 — Pre-training & Annealing
 - **Industrial Recall**: Data pipeline using rule-based filtering, FastText classifiers, and semantic retrieval for Verilog, CUDA, firmware C, and CadQuery.
 - **Refinement**: OCR extraction from technical manuals, multi-level deduplication, and repository-level fork consolidation.
+- **Training**: 15T total tokens using Autoregressive LM + Fill-in-the-Middle (FIM) objectives on 4,096 GPUs.
 ### Stage 2 — Mid-Training (Context Extension)
 Context window extended progressively from 8K to 128K tokens:
 - **8K → 32K**: Targets file-level tasks like completing RTL modules or kernel functions.
 - **32K → 128K**: Unlocks long-context capabilities for extended debugging and cross-module projects.
 ---
 ## Usage
 pip install transformers accelerate
 ```
+### Code Completion
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
+model_id = "Multilingual-Multimodal-NLP/IndustrialCoder-Base"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(
     model_id,
     torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True,
 )
+prompt = """// Synthesizable Verilog: UART transmitter (8N1 protocol)
+module uart_tx (
+    input wire clk,
+    input wire rst_n,
+    input wire [7:0] data_in,
+    input wire tx_start,
+    output reg tx,
+    output reg tx_busy
+);
+"""
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
 outputs = model.generate(
     **inputs,
+    max_new_tokens=512,
     temperature=0.2,
     do_sample=True,
 )
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ### Fill-in-the-Middle (FIM)
+InCoder-32B-Base supports FIM completion for code infilling tasks:
 ```python
 prefix = """// CUDA kernel for RMS Normalization
+__global__ void rms_norm_kernel(float* output, const float* input,
                                  const float* weight, int N, float eps) {
     int idx = blockIdx.x;
 """
     output[idx * N + tid] = normalized * weight[tid];
 }"""
+fim_prompt = f"<|fim_prefix|>{prefix}<|fim_suffix|>{suffix}<|fim_middle|>"
 inputs = tokenizer(fim_prompt, return_tensors="pt").to(model.device)
 outputs = model.generate(**inputs, max_new_tokens=256)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+### Deployment with vLLM
+```bash
+vllm serve Multilingual-Multimodal-NLP/IndustrialCoder-Base \
+    --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code
+```
+---
+## Fine-tuning
+We provide an SFT framework in the [GitHub repository](https://github.com/CSJianYang/Industrial-Coder/tree/main/sft). See the README for data preparation and training instructions.
+---
+## Model Family
+| Model | Type | HuggingFace |
+|---|---|---|
+| InCoder-32B-Base | Pre-trained | [🤗 IndustrialCoder-Base](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Base) |
+| InCoder-32B | Instruct | [🤗 IndustrialCoder](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder) |
+| InCoder-32B-Thinking | Reasoning | [🤗 IndustrialCoder-Thinking](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-Thinking) |
+| InCoder-32B-FP8 | FP8 Quantized | [🤗 IndustrialCoder-32B-FP8](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-FP8) |
+| InCoder-32B-AWQ-INT4 | AWQ INT4 | [🤗 IndustrialCoder-32B-AWQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-AWQ-INT4) |
+| InCoder-32B-GPTQ-INT4 | GPTQ INT4 | [🤗 IndustrialCoder-32B-GPTQ-INT4](https://huggingface.co/Multilingual-Multimodal-NLP/IndustrialCoder-32B-GPTQ-INT4) |
 ---
 ## Limitations & Disclaimers
+This is a **base model** — it has not been instruction-tuned and does not follow conversational instructions. It is best suited for:
+- Code completion and generation
+- Fill-in-the-middle (FIM) tasks
+- Further fine-tuning for downstream applications
+Always review and test generated code in a sandboxed environment. Industrial code (RTL, embedded firmware, GPU kernels) requires expert review before deployment.
 ---
 ```bibtex
 @article{yang2026incoder,
   title={InCoder-32B: Code Foundation Model for Industrial Scenarios},
+  author={Yang, Jian and Zhang, Wei and Wu, Jiajun and Cheng, Junhang and Guo, Shawn
+          and Wang, Haowen and Gu, Weicheng and Du, Yaxin and Li, Joseph and Xu, Fanglin
           and others},
   journal={arXiv preprint arXiv:2603.16790},
   year={2026}
 }
+```