Pankaj8922
/

code-lang-bert-small

+---
+---
+language:
+- code
+tags:
+- code
+- programming-language
+- classification
+- bert
+- text-classification
+license: apache-2.0
+datasets:
+- kaushik-harsh-99/Code-Language-Classification
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+model-index:
+- name: code-lang-bert-small
+  results:
+  - task:
+      type: text-classification
+      name: Programming Language Identification
+    dataset:
+      type: kaushik-harsh-99/Code-Language-Classification
+      name: Code Language Classification
+      split: test
+    metrics:
+    - type: accuracy
+      value: 0.9663
+    - type: f1 (macro)
+      value: 0.9662
+    - type: f1 (weighted)
+      value: 0.9662
+    - type: precision (macro)
+      value: 0.9663
+    - type: recall (macro)
+      value: 0.9663
+---
+# Model Card for code-lang-bert-small
+A fine-tuned BERT-small model for identifying programming languages from code snippets. The model classifies raw source code into one of 16 supported languages with high accuracy.
+## Model Details
+### Model Description
+This model is a fine-tuned version of `prajjwal1/bert-small` (29M parameters) designed for the task of programming language identification. By analyzing the syntax, keywords, and structural patterns of source code, it accurately predicts the programming language of a given snippet.
+- **Developed by:** Pankaj8922
+- **Model type:** Encoder-only Transformer (BERT-small) for sequence classification
+- **Language(s):** 16 programming and markup languages (see below)
+- **License:** Apache 2.0
+- **Finetuned from model:** [prajjwal1/bert-small](https://huggingface.co/prajjwal1/bert-small)
+### Supported Languages
+Rust, Java, Dart, Python, Go, HTML, JavaScript, Typescript, C, CSS, C#, Markdown, Assembly, Lua, C++, Kotlin
+## Uses
+### Direct Use
+The model is intended for classifying code snippets. It can be used directly with the Hugging Face `pipeline` API or integrated into applications for code tagging, automated documentation, or content filtering.
+```python
+from transformers import pipeline
+classifier = pipeline(
+    "text-classification",
+    model="Pankaj8922/code-lang-bert-small"
+)
+code_snippet = """
+def quicksort(arr):
+    if len(arr) <= 1:
+        return arr
+    pivot = arr[len(arr) // 2]
+    return quicksort(left) + mid + quicksort(right)
+"""
+result = classifier(code_snippet)
+print(result)
+# [{'label': 'Python', 'score': 0.99}]
+```
+### Out-of-Scope Use
+The model is trained to classify full files or substantial code snippets. It may not perform well on:
+- Very short, ambiguous one-liners.
+- Heavily obfuscated or minified code.
+- Code containing multiple languages (e.g., a Python file with extensive embedded SQL).
+- Languages not present in the 16 supported classes.
+## Bias, Risks, and Limitations
+The model may exhibit biases present in the training data distribution. Languages with syntactically similar constructs (e.g., C and C++, JavaScript and TypeScript) are the most common sources of confusion, as reflected in the confusion matrix. Performance on code from very niche or domain-specific libraries may be lower.
+## Training Details
+### Training Data
+The model was trained on the [Code-Language-Classification](https://huggingface.co/datasets/kaushik-harsh-99/Code-Language-Classification) dataset. The official `train`, `validation`, and `test` splits were used.
+- **Train samples:** 1,600,000
+- **Validation samples:** 32,000
+- **Test samples:** 32,000
+- **Classes:** 16 (perfectly balanced, 2000 samples per class in test set)
+### Training Procedure
+The BERT-small model was fine-tuned on 2 x T4 GPUs with dynamic padding for efficiency. Training was configured for 5 epochs with early stopping, but was manually stopped after 4 epochs as the model had already converged.
+- **Batch size:** 256 (128 per device x 2 GPUs)
+- **Learning rate:** 3e-5
+- **Optimizer:** AdamW (weight decay: 0.01)
+- **Max sequence length:** 512 tokens
+- **Early stopping patience:** 2 epochs
+- **Checkpointing:** Best model based on validation accuracy saved to the Hub.
+## Evaluation
+The evaluation was performed on the held-out test set of 32,000 samples using the official script provided in the repository.
+### Testing Metrics
+| Metric           | Value    |
+|------------------|----------|
+| Accuracy         | 96.63%   |
+| Macro F1         | 96.62%   |
+| Weighted F1      | 96.62%   |
+| Macro Precision  | 96.63%   |
+| Macro Recall     | 96.63%   |
+| Eval Loss        | 0.1147   |
+### Per-Class Performance
+| Language   | Precision | Recall | F1-Score |
+|------------|-----------|--------|----------|
+| Rust       | 0.9885    | 0.9925 | 0.9905   |
+| Java       | 0.9731    | 0.9785 | 0.9758   |
+| Dart       | 0.9772    | 0.9850 | 0.9811   |
+| Python     | 0.9890    | 0.9880 | 0.9885   |
+| Go         | 0.9859    | 0.9800 | 0.9829   |
+| HTML       | 0.9279    | 0.8885 | 0.9078   |
+| JavaScript | 0.8859    | 0.8930 | 0.8894   |
+| TypeScript | 0.9466    | 0.9580 | 0.9523   |
+| C          | 0.9566    | 0.9375 | 0.9470   |
+| CSS        | 0.9728    | 0.9845 | 0.9786   |
+| C#         | 0.9895    | 0.9870 | 0.9882   |
+| Markdown   | 0.9671    | 0.9695 | 0.9683   |
+| Assembly   | 0.9935    | 0.9945 | 0.9940   |
+| Lua        | 0.9885    | 0.9915 | 0.9900   |
+| C++        | 0.9770    | 0.9760 | 0.9765   |
+| Kotlin     | 0.9840    | 0.9870 | 0.9855   |
+### Key Observations
+- The model performs exceptionally well on most languages, with 11 of 16 classes achieving an F1-score of 97% or higher.
+- **JavaScript** (F1: 0.89) and **HTML** (F1: 0.91) are the most challenging classes, commonly confused with each other and with TypeScript/CSS.
+- The model is highly confident in distinguishing structurally unique languages like **Assembly** (F1: 0.994) and **Python** (F1: 0.989).
+## Environmental Impact
+- **Hardware Type:** 2 x NVIDIA T4 GPUs
+- **Hours used:** Approx. 4 epochs of training
+- **Cloud Provider:** Not specified
+- **Compute Region:** Not specified
+*Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).*