Pankaj8922 commited on
Commit
ffada80
·
verified ·
1 Parent(s): b5098de

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +171 -0
README.md ADDED
@@ -0,0 +1,171 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ ---
4
+ language:
5
+ - code
6
+ tags:
7
+ - code
8
+ - programming-language
9
+ - classification
10
+ - bert
11
+ - text-classification
12
+ license: apache-2.0
13
+ datasets:
14
+ - kaushik-harsh-99/Code-Language-Classification
15
+ metrics:
16
+ - accuracy
17
+ - f1
18
+ - precision
19
+ - recall
20
+ model-index:
21
+ - name: code-lang-bert-small
22
+ results:
23
+ - task:
24
+ type: text-classification
25
+ name: Programming Language Identification
26
+ dataset:
27
+ type: kaushik-harsh-99/Code-Language-Classification
28
+ name: Code Language Classification
29
+ split: test
30
+ metrics:
31
+ - type: accuracy
32
+ value: 0.9663
33
+ - type: f1 (macro)
34
+ value: 0.9662
35
+ - type: f1 (weighted)
36
+ value: 0.9662
37
+ - type: precision (macro)
38
+ value: 0.9663
39
+ - type: recall (macro)
40
+ value: 0.9663
41
+ ---
42
+
43
+ # Model Card for code-lang-bert-small
44
+
45
+ A fine-tuned BERT-small model for identifying programming languages from code snippets. The model classifies raw source code into one of 16 supported languages with high accuracy.
46
+
47
+ ## Model Details
48
+
49
+ ### Model Description
50
+
51
+ This model is a fine-tuned version of `prajjwal1/bert-small` (29M parameters) designed for the task of programming language identification. By analyzing the syntax, keywords, and structural patterns of source code, it accurately predicts the programming language of a given snippet.
52
+
53
+ - **Developed by:** Pankaj8922
54
+ - **Model type:** Encoder-only Transformer (BERT-small) for sequence classification
55
+ - **Language(s):** 16 programming and markup languages (see below)
56
+ - **License:** Apache 2.0
57
+ - **Finetuned from model:** [prajjwal1/bert-small](https://huggingface.co/prajjwal1/bert-small)
58
+
59
+ ### Supported Languages
60
+
61
+ Rust, Java, Dart, Python, Go, HTML, JavaScript, Typescript, C, CSS, C#, Markdown, Assembly, Lua, C++, Kotlin
62
+
63
+ ## Uses
64
+
65
+ ### Direct Use
66
+
67
+ The model is intended for classifying code snippets. It can be used directly with the Hugging Face `pipeline` API or integrated into applications for code tagging, automated documentation, or content filtering.
68
+
69
+ ```python
70
+ from transformers import pipeline
71
+
72
+ classifier = pipeline(
73
+ "text-classification",
74
+ model="Pankaj8922/code-lang-bert-small"
75
+ )
76
+
77
+ code_snippet = """
78
+ def quicksort(arr):
79
+ if len(arr) <= 1:
80
+ return arr
81
+ pivot = arr[len(arr) // 2]
82
+ return quicksort(left) + mid + quicksort(right)
83
+ """
84
+
85
+ result = classifier(code_snippet)
86
+ print(result)
87
+ # [{'label': 'Python', 'score': 0.99}]
88
+ ```
89
+
90
+ ### Out-of-Scope Use
91
+
92
+ The model is trained to classify full files or substantial code snippets. It may not perform well on:
93
+ - Very short, ambiguous one-liners.
94
+ - Heavily obfuscated or minified code.
95
+ - Code containing multiple languages (e.g., a Python file with extensive embedded SQL).
96
+ - Languages not present in the 16 supported classes.
97
+
98
+ ## Bias, Risks, and Limitations
99
+
100
+ The model may exhibit biases present in the training data distribution. Languages with syntactically similar constructs (e.g., C and C++, JavaScript and TypeScript) are the most common sources of confusion, as reflected in the confusion matrix. Performance on code from very niche or domain-specific libraries may be lower.
101
+
102
+ ## Training Details
103
+
104
+ ### Training Data
105
+
106
+ The model was trained on the [Code-Language-Classification](https://huggingface.co/datasets/kaushik-harsh-99/Code-Language-Classification) dataset. The official `train`, `validation`, and `test` splits were used.
107
+ - **Train samples:** 1,600,000
108
+ - **Validation samples:** 32,000
109
+ - **Test samples:** 32,000
110
+ - **Classes:** 16 (perfectly balanced, 2000 samples per class in test set)
111
+
112
+ ### Training Procedure
113
+
114
+ The BERT-small model was fine-tuned on 2 x T4 GPUs with dynamic padding for efficiency. Training was configured for 5 epochs with early stopping, but was manually stopped after 4 epochs as the model had already converged.
115
+
116
+ - **Batch size:** 256 (128 per device x 2 GPUs)
117
+ - **Learning rate:** 3e-5
118
+ - **Optimizer:** AdamW (weight decay: 0.01)
119
+ - **Max sequence length:** 512 tokens
120
+ - **Early stopping patience:** 2 epochs
121
+ - **Checkpointing:** Best model based on validation accuracy saved to the Hub.
122
+
123
+ ## Evaluation
124
+
125
+ The evaluation was performed on the held-out test set of 32,000 samples using the official script provided in the repository.
126
+
127
+ ### Testing Metrics
128
+
129
+ | Metric | Value |
130
+ |------------------|----------|
131
+ | Accuracy | 96.63% |
132
+ | Macro F1 | 96.62% |
133
+ | Weighted F1 | 96.62% |
134
+ | Macro Precision | 96.63% |
135
+ | Macro Recall | 96.63% |
136
+ | Eval Loss | 0.1147 |
137
+
138
+ ### Per-Class Performance
139
+
140
+ | Language | Precision | Recall | F1-Score |
141
+ |------------|-----------|--------|----------|
142
+ | Rust | 0.9885 | 0.9925 | 0.9905 |
143
+ | Java | 0.9731 | 0.9785 | 0.9758 |
144
+ | Dart | 0.9772 | 0.9850 | 0.9811 |
145
+ | Python | 0.9890 | 0.9880 | 0.9885 |
146
+ | Go | 0.9859 | 0.9800 | 0.9829 |
147
+ | HTML | 0.9279 | 0.8885 | 0.9078 |
148
+ | JavaScript | 0.8859 | 0.8930 | 0.8894 |
149
+ | TypeScript | 0.9466 | 0.9580 | 0.9523 |
150
+ | C | 0.9566 | 0.9375 | 0.9470 |
151
+ | CSS | 0.9728 | 0.9845 | 0.9786 |
152
+ | C# | 0.9895 | 0.9870 | 0.9882 |
153
+ | Markdown | 0.9671 | 0.9695 | 0.9683 |
154
+ | Assembly | 0.9935 | 0.9945 | 0.9940 |
155
+ | Lua | 0.9885 | 0.9915 | 0.9900 |
156
+ | C++ | 0.9770 | 0.9760 | 0.9765 |
157
+ | Kotlin | 0.9840 | 0.9870 | 0.9855 |
158
+
159
+ ### Key Observations
160
+ - The model performs exceptionally well on most languages, with 11 of 16 classes achieving an F1-score of 97% or higher.
161
+ - **JavaScript** (F1: 0.89) and **HTML** (F1: 0.91) are the most challenging classes, commonly confused with each other and with TypeScript/CSS.
162
+ - The model is highly confident in distinguishing structurally unique languages like **Assembly** (F1: 0.994) and **Python** (F1: 0.989).
163
+
164
+ ## Environmental Impact
165
+
166
+ - **Hardware Type:** 2 x NVIDIA T4 GPUs
167
+ - **Hours used:** Approx. 4 epochs of training
168
+ - **Cloud Provider:** Not specified
169
+ - **Compute Region:** Not specified
170
+
171
+ *Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).*