MatthewsO3 commited on
Commit
4805afc
·
verified ·
1 Parent(s): b8dc83b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +213 -2
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  datasets:
3
- - codeparrot/github-code
4
  language:
5
  - en
6
  metrics:
@@ -17,4 +17,215 @@ tags:
17
  - erlang
18
  - graphcodebert
19
  - mlm
20
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  datasets:
3
+ - codeparrot/github-code-clean
4
  language:
5
  - en
6
  metrics:
 
17
  - erlang
18
  - graphcodebert
19
  - mlm
20
+ ---
21
+
22
+ # GraphCodeBERT Fine-tuned on C++ and Erlang (MLM)
23
+
24
+ A fine-tuned version of [microsoft/graphcodebert-base](https://huggingface.co/microsoft/graphcodebert-base) for Masked Language Modeling (MLM) on C++ and Erlang source code. The model is additionally evaluated on Python, Java, and JavaScript to assess cross-lingual transfer.
25
+
26
+ ## Model Details
27
+
28
+ ### Model Description
29
+
30
+ This model extends GraphCodeBERT's pre-training with continued MLM training on C++ and Erlang code. GraphCodeBERT incorporates data flow graphs (DFGs) alongside token sequences, enabling the model to capture semantic relationships between variables in addition to syntactic structure. The fine-tuning objective combines standard MLM loss with an edge-prediction loss over the DFG.
31
+
32
+ - **Developed by:** [GitHub link to be added]
33
+ - **Model type:** Transformer encoder (RoBERTa-based), Masked Language Model
34
+ - **Languages:** C++, Erlang (training); Python, Java, JavaScript (zero-shot evaluation)
35
+ - **License:** MIT
36
+ - **Finetuned from:** [microsoft/graphcodebert-base](https://huggingface.co/microsoft/graphcodebert-base)
37
+
38
+ ### Model Sources
39
+
40
+ - **Repository:** [To be added]
41
+
42
+ ## Uses
43
+
44
+ ### Direct Use
45
+
46
+ This model can be used as-is for masked token prediction in C++ and Erlang codebases — for example, code completion suggestions, token-level error detection, or identifier prediction. It can be loaded with the `fill-mask` pipeline:
47
+
48
+ ```python
49
+ from transformers import pipeline
50
+
51
+ fill = pipeline("fill-mask", model="<your-model-id>")
52
+ fill("int <mask> = 0;")
53
+ ```
54
+
55
+ ### Downstream Use
56
+
57
+ The model can serve as a backbone for downstream code intelligence tasks such as:
58
+ - Code search and clone detection
59
+ - Defect prediction
60
+ - Variable misuse detection
61
+ - Cross-language code understanding (given its zero-shot transfer capability to Python, Java, and JavaScript)
62
+
63
+ ### Out-of-Scope Use
64
+
65
+ - Natural language understanding tasks
66
+ - Languages not represented in the training or base model data (performance may degrade significantly)
67
+ - Code generation (autoregressive generation — this is an encoder-only model)
68
+
69
+ ## Bias, Risks, and Limitations
70
+
71
+ - The Erlang training data was collected via a custom scraper and is not publicly available, which limits reproducibility for that language.
72
+ - C++, Python, Java, and JavaScript training data comes from GitHub, which may reflect biases in open-source coding style, library choices, and developer demographics.
73
+ - The model may perform poorly on domain-specific dialects or coding conventions not represented in GitHub data.
74
+ - No safety filtering beyond what is present in the upstream `codeparrot/github-code-clean` dataset was applied.
75
+
76
+ ### Recommendations
77
+
78
+ Users should validate the model's predictions on their specific codebase domain before deploying in production. For Erlang specifically, results may vary depending on how closely the target code resembles OTP-style patterns present in the scraping corpus.
79
+
80
+ ## How to Get Started with the Model
81
+
82
+ ```python
83
+ from transformers import RobertaTokenizer, RobertaForMaskedLM
84
+ import torch
85
+
86
+ tokenizer = RobertaTokenizer.from_pretrained("microsoft/graphcodebert-base")
87
+ model = RobertaForMaskedLM.from_pretrained("<your-model-id>")
88
+
89
+ code = "int <mask> = 42;"
90
+ inputs = tokenizer(code, return_tensors="pt")
91
+ with torch.no_grad():
92
+ logits = model(**inputs).logits
93
+
94
+ mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
95
+ predicted_id = logits[0, mask_idx].argmax(dim=-1)
96
+ print(tokenizer.decode(predicted_id))
97
+ ```
98
+
99
+ > **Note:** For full GraphCodeBERT-style inference with data flow graph inputs, refer to the evaluation script in the repository.
100
+
101
+ ## Training Details
102
+
103
+ ### Training Data
104
+
105
+ | Language | Samples | Source |
106
+ |------------|--------:|-----------------------------------------|
107
+ | C++ | 250,000 | `codeparrot/github-code-clean` |
108
+ | Erlang | 250,000 | Custom scraper (not publicly available) |
109
+
110
+ **Total training tokens:** ~500k code snippets
111
+ **Validation split:** 5% held out from training data
112
+
113
+ Python, Java, and JavaScript data from `codeparrot/github-code-clean` was used **only for evaluation** (2,500 samples each).
114
+
115
+ ### Training Procedure
116
+
117
+ The model was fine-tuned with the standard GraphCodeBERT MLM objective, which combines:
118
+ 1. **MLM loss** — predict randomly masked tokens in the code sequence
119
+ 2. **Edge prediction loss** — predict edges in the data flow graph (DFG)
120
+
121
+ #### Training Hyperparameters
122
+
123
+ | Hyperparameter | Value |
124
+ |------------------------|---------|
125
+ | Batch size | 32 |
126
+ | Epochs | 6 |
127
+ | Learning rate | 2e-5 |
128
+ | Max sequence length | 256 |
129
+ | Warmup steps | 2,000 |
130
+ | MLM probability | 0.15 |
131
+ | Validation split | 0.05 |
132
+ | Weight decay | 0.01 |
133
+ | Early stopping patience| 3 |
134
+ | Training precision | fp32 |
135
+
136
+ #### Training Loss Progression
137
+
138
+ | Epoch | Train Total Loss | Val Total Loss | Train MLM Loss | Val MLM Loss |
139
+ |-------|-----------------|----------------|----------------|--------------|
140
+ | 0 | 0.8930 | 0.5168 | 0.6517 | 0.4587 |
141
+ | 1 | 0.5258 | 0.4384 | 0.4806 | 0.4178 |
142
+ | 2 | 0.4645 | 0.3999 | 0.4418 | 0.3893 |
143
+ | 3 | 0.4346 | 0.3862 | 0.4191 | 0.3773 |
144
+ | 4 | 0.4181 | 0.3753 | 0.4062 | 0.3697 |
145
+ | 5 | 0.4100 | **0.3701** | 0.3997 | 0.3650 |
146
+
147
+ Best checkpoint: **epoch 5** (val loss: 0.3701)
148
+
149
+ #### Hardware
150
+
151
+ - **GPU:** NVIDIA Quadro RTX 4000 (16 GB VRAM)
152
+
153
+ ## Evaluation
154
+
155
+ ### Testing Data, Factors & Metrics
156
+
157
+ #### Testing Data
158
+
159
+ 2,500 samples per language were used for evaluation:
160
+ - **C++** — `codeparrot/github-code-clean`
161
+ - **Erlang** — custom scraped dataset
162
+ - **Python** — `codeparrot/github-code-clean` *(zero-shot)*
163
+ - **Java** — `codeparrot/github-code-clean` *(zero-shot)*
164
+ - **JavaScript** — `codeparrot/github-code-clean` *(zero-shot)*
165
+
166
+ #### Metrics
167
+
168
+ - **Top-1 Accuracy** — fraction of masked tokens where the model's top prediction matches the original token
169
+ - **Top-5 Accuracy** — fraction of masked tokens where the correct token appears in the top 5 predictions
170
+ - **Perplexity** — exponentiated mean negative log-likelihood over masked tokens (lower is better)
171
+
172
+ Evaluation uses a mask ratio of 0.15, consistent with training.
173
+
174
+ ### Results
175
+
176
+ Final evaluation metrics (epoch 6):
177
+
178
+ | Language | Top-1 Acc | Top-5 Acc | Perplexity |
179
+ |------------|----------:|----------:|-----------:|
180
+ | C++ | **88.5%** | **94.2%** | **~1.95** |
181
+ | Erlang | 86.5% | 93.1% | ~2.05 |
182
+ | Java | 83.5% | 91.5% | ~2.55 |
183
+ | Python | 77.8% | 88.6% | ~3.30 |
184
+ | JavaScript | 76.2% | 88.6% | ~3.35 |
185
+
186
+ C++ and Erlang (the two training languages) achieve the strongest results. The model shows solid zero-shot transfer to Java, and reasonable transfer to Python and JavaScript despite not being trained on those languages.
187
+
188
+ #### Summary
189
+
190
+ The model converges steadily across all 6 epochs. C++ and Erlang show the sharpest perplexity improvements in the first two epochs (from ~5.1 → ~2.1 and ~10.5 → ~2.1 respectively), then plateau. Java, Python, and JavaScript perplexity curves are flatter throughout, consistent with zero-shot generalization rather than direct training signal.
191
+
192
+ ## Environmental Impact
193
+
194
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
195
+
196
+ - **Hardware Type:** NVIDIA Quadro RTX 4000 (16 GB VRAM)
197
+ - **Hours used:** 24 hour
198
+
199
+ ## Technical Specifications
200
+
201
+ ### Model Architecture and Objective
202
+
203
+ GraphCodeBERT uses a 12-layer RoBERTa-style transformer encoder. The model takes as input a concatenation of (1) code token sequences and (2) data flow graph node representations, with a 2D attention mask encoding which tokens and DFG nodes can attend to each other. The training objective is a sum of MLM loss and DFG edge prediction loss.
204
+
205
+ ### Compute Infrastructure
206
+
207
+ #### Hardware
208
+ NVIDIA Quadro RTX 4000, 16 GB VRAM
209
+
210
+ #### Software
211
+ - `transformers` (HuggingFace)
212
+ - `torch`
213
+ - `tree-sitter` (for DFG extraction during evaluation)
214
+
215
+ ## Citation
216
+
217
+ If you use this model, please also cite the original GraphCodeBERT paper:
218
+
219
+ **BibTeX:**
220
+ ```bibtex
221
+ @inproceedings{guo2021graphcodebert,
222
+ title={GraphCodeBERT: Pre-training Code Representations with Data Flow},
223
+ author={Guo, Daya and Ren, Shuo and Lu, Shuai and Feng, Zhangyin and Tang, Duyu and Liu, Shujie and Zhou, Long and Duan, Nan and Svyatkovskiy, Alexey and Fu, Shengyu and Tufano, Michele and Deng, Shao Kun and Clement, Colin and Drain, Dawn and Sundaresan, Neel and Yin, Jian and Jiang, Daxin and Zhou, Ming},
224
+ booktitle={International Conference on Learning Representations},
225
+ year={2021}
226
+ }
227
+ ```
228
+
229
+ ## Model Card Contact
230
+
231
+ [To be added — link your GitHub or HuggingFace profile here]