| --- |
| |
| --- |
| language: |
| - code |
| tags: |
| - code |
| - programming-language |
| - classification |
| - bert |
| - text-classification |
| license: apache-2.0 |
| datasets: |
| - kaushik-harsh-99/Code-Language-Classification |
| metrics: |
| - accuracy |
| - f1 |
| - precision |
| - recall |
| model-index: |
| - name: code-lang-bert-small |
| results: |
| - task: |
| type: text-classification |
| name: Programming Language Identification |
| dataset: |
| type: kaushik-harsh-99/Code-Language-Classification |
| name: Code Language Classification |
| split: test |
| metrics: |
| - type: accuracy |
| value: 0.9663 |
| - type: f1 (macro) |
| value: 0.9662 |
| - type: f1 (weighted) |
| value: 0.9662 |
| - type: precision (macro) |
| value: 0.9663 |
| - type: recall (macro) |
| value: 0.9663 |
| --- |
| |
| # Model Card for code-lang-bert-small |
|
|
| A fine-tuned BERT-small model for identifying programming languages from code snippets. The model classifies raw source code into one of 16 supported languages with high accuracy. |
|
|
| ## Model Details |
|
|
| ### Model Description |
|
|
| This model is a fine-tuned version of `prajjwal1/bert-small` (29M parameters) designed for the task of programming language identification. By analyzing the syntax, keywords, and structural patterns of source code, it accurately predicts the programming language of a given snippet. |
|
|
| - **Developed by:** Pankaj8922 |
| - **Model type:** Encoder-only Transformer (BERT-small) for sequence classification |
| - **Language(s):** 16 programming and markup languages (see below) |
| - **License:** Apache 2.0 |
| - **Finetuned from model:** [prajjwal1/bert-small](https://huggingface.co/prajjwal1/bert-small) |
|
|
| ### Supported Languages |
|
|
| Rust, Java, Dart, Python, Go, HTML, JavaScript, Typescript, C, CSS, C#, Markdown, Assembly, Lua, C++, Kotlin |
|
|
| ## Uses |
|
|
| ### Direct Use |
|
|
| The model is intended for classifying code snippets. It can be used directly with the Hugging Face `pipeline` API or integrated into applications for code tagging, automated documentation, or content filtering. |
|
|
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline( |
| "text-classification", |
| model="Pankaj8922/code-lang-bert-small" |
| ) |
| |
| code_snippet = """ |
| def quicksort(arr): |
| if len(arr) <= 1: |
| return arr |
| pivot = arr[len(arr) // 2] |
| return quicksort(left) + mid + quicksort(right) |
| """ |
| |
| result = classifier(code_snippet) |
| print(result) |
| # [{'label': 'Python', 'score': 0.99}] |
| ``` |
|
|
| ### Out-of-Scope Use |
|
|
| The model is trained to classify full files or substantial code snippets. It may not perform well on: |
| - Very short, ambiguous one-liners. |
| - Heavily obfuscated or minified code. |
| - Code containing multiple languages (e.g., a Python file with extensive embedded SQL). |
| - Languages not present in the 16 supported classes. |
|
|
| ## Bias, Risks, and Limitations |
|
|
| The model may exhibit biases present in the training data distribution. Languages with syntactically similar constructs (e.g., C and C++, JavaScript and TypeScript) are the most common sources of confusion, as reflected in the confusion matrix. Performance on code from very niche or domain-specific libraries may be lower. |
|
|
| ## Training Details |
|
|
| ### Training Data |
|
|
| The model was trained on the [Code-Language-Classification](https://huggingface.co/datasets/kaushik-harsh-99/Code-Language-Classification) dataset. The official `train`, `validation`, and `test` splits were used. |
| - **Train samples:** 1,600,000 |
| - **Validation samples:** 32,000 |
| - **Test samples:** 32,000 |
| - **Classes:** 16 (perfectly balanced, 2000 samples per class in test set) |
|
|
| ### Training Procedure |
|
|
| The BERT-small model was fine-tuned on 2 x T4 GPUs with dynamic padding for efficiency. Training was configured for 5 epochs with early stopping, but was manually stopped after 4 epochs as the model had already converged. |
|
|
| - **Batch size:** 256 (128 per device x 2 GPUs) |
| - **Learning rate:** 3e-5 |
| - **Optimizer:** AdamW (weight decay: 0.01) |
| - **Max sequence length:** 512 tokens |
| - **Early stopping patience:** 2 epochs |
| - **Checkpointing:** Best model based on validation accuracy saved to the Hub. |
|
|
| ## Evaluation |
|
|
| The evaluation was performed on the held-out test set of 32,000 samples using the official script provided in the repository. |
|
|
| ### Testing Metrics |
|
|
| | Metric | Value | |
| |------------------|----------| |
| | Accuracy | 96.63% | |
| | Macro F1 | 96.62% | |
| | Weighted F1 | 96.62% | |
| | Macro Precision | 96.63% | |
| | Macro Recall | 96.63% | |
| | Eval Loss | 0.1147 | |
|
|
| ### Per-Class Performance |
|
|
| | Language | Precision | Recall | F1-Score | |
| |------------|-----------|--------|----------| |
| | Rust | 0.9885 | 0.9925 | 0.9905 | |
| | Java | 0.9731 | 0.9785 | 0.9758 | |
| | Dart | 0.9772 | 0.9850 | 0.9811 | |
| | Python | 0.9890 | 0.9880 | 0.9885 | |
| | Go | 0.9859 | 0.9800 | 0.9829 | |
| | HTML | 0.9279 | 0.8885 | 0.9078 | |
| | JavaScript | 0.8859 | 0.8930 | 0.8894 | |
| | TypeScript | 0.9466 | 0.9580 | 0.9523 | |
| | C | 0.9566 | 0.9375 | 0.9470 | |
| | CSS | 0.9728 | 0.9845 | 0.9786 | |
| | C# | 0.9895 | 0.9870 | 0.9882 | |
| | Markdown | 0.9671 | 0.9695 | 0.9683 | |
| | Assembly | 0.9935 | 0.9945 | 0.9940 | |
| | Lua | 0.9885 | 0.9915 | 0.9900 | |
| | C++ | 0.9770 | 0.9760 | 0.9765 | |
| | Kotlin | 0.9840 | 0.9870 | 0.9855 | |
|
|
| ### Key Observations |
| - The model performs exceptionally well on most languages, with 11 of 16 classes achieving an F1-score of 97% or higher. |
| - **JavaScript** (F1: 0.89) and **HTML** (F1: 0.91) are the most challenging classes, commonly confused with each other and with TypeScript/CSS. |
| - The model is highly confident in distinguishing structurally unique languages like **Assembly** (F1: 0.994) and **Python** (F1: 0.989). |
|
|
| ## Environmental Impact |
|
|
| - **Hardware Type:** 2 x NVIDIA T4 GPUs |
| - **Hours used:** Approx. 4 epochs of training |
| - **Cloud Provider:** Not specified |
| - **Compute Region:** Not specified |
|
|
| *Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).* |