MatthewsO3
/

GraphCode-CErl-base

 ---
 datasets:
+- codeparrot/github-code-clean
 language:
 - en
 metrics:
 - erlang
 - graphcodebert
 - mlm
+---
+# GraphCodeBERT Fine-tuned on C++ and Erlang (MLM)
+A fine-tuned version of [microsoft/graphcodebert-base](https://huggingface.co/microsoft/graphcodebert-base) for Masked Language Modeling (MLM) on C++ and Erlang source code. The model is additionally evaluated on Python, Java, and JavaScript to assess cross-lingual transfer.
+## Model Details
+### Model Description
+This model extends GraphCodeBERT's pre-training with continued MLM training on C++ and Erlang code. GraphCodeBERT incorporates data flow graphs (DFGs) alongside token sequences, enabling the model to capture semantic relationships between variables in addition to syntactic structure. The fine-tuning objective combines standard MLM loss with an edge-prediction loss over the DFG.
+- **Developed by:** [GitHub link to be added]
+- **Model type:** Transformer encoder (RoBERTa-based), Masked Language Model
+- **Languages:** C++, Erlang (training); Python, Java, JavaScript (zero-shot evaluation)
+- **License:** MIT
+- **Finetuned from:** [microsoft/graphcodebert-base](https://huggingface.co/microsoft/graphcodebert-base)
+### Model Sources
+- **Repository:** [To be added]
+## Uses
+### Direct Use
+This model can be used as-is for masked token prediction in C++ and Erlang codebases — for example, code completion suggestions, token-level error detection, or identifier prediction. It can be loaded with the `fill-mask` pipeline:
+```python
+from transformers import pipeline
+fill = pipeline("fill-mask", model="<your-model-id>")
+fill("int <mask> = 0;")
+```
+### Downstream Use
+The model can serve as a backbone for downstream code intelligence tasks such as:
+- Code search and clone detection
+- Defect prediction
+- Variable misuse detection
+- Cross-language code understanding (given its zero-shot transfer capability to Python, Java, and JavaScript)
+### Out-of-Scope Use
+- Natural language understanding tasks
+- Languages not represented in the training or base model data (performance may degrade significantly)
+- Code generation (autoregressive generation — this is an encoder-only model)
+## Bias, Risks, and Limitations
+- The Erlang training data was collected via a custom scraper and is not publicly available, which limits reproducibility for that language.
+- C++, Python, Java, and JavaScript training data comes from GitHub, which may reflect biases in open-source coding style, library choices, and developer demographics.
+- The model may perform poorly on domain-specific dialects or coding conventions not represented in GitHub data.
+- No safety filtering beyond what is present in the upstream `codeparrot/github-code-clean` dataset was applied.
+### Recommendations
+Users should validate the model's predictions on their specific codebase domain before deploying in production. For Erlang specifically, results may vary depending on how closely the target code resembles OTP-style patterns present in the scraping corpus.
+## How to Get Started with the Model
+```python
+from transformers import RobertaTokenizer, RobertaForMaskedLM
+import torch
+tokenizer = RobertaTokenizer.from_pretrained("microsoft/graphcodebert-base")
+model = RobertaForMaskedLM.from_pretrained("<your-model-id>")
+code = "int <mask> = 42;"
+inputs = tokenizer(code, return_tensors="pt")
+with torch.no_grad():
+    logits = model(**inputs).logits
+mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+predicted_id = logits[0, mask_idx].argmax(dim=-1)
+print(tokenizer.decode(predicted_id))
+```
+> **Note:** For full GraphCodeBERT-style inference with data flow graph inputs, refer to the evaluation script in the repository.
+## Training Details
+### Training Data
+| Language   | Samples | Source                                  |
+|------------|--------:|-----------------------------------------|
+| C++        | 250,000 | `codeparrot/github-code-clean`          |
+| Erlang     | 250,000 | Custom scraper (not publicly available) |
+**Total training tokens:** ~500k code snippets
+**Validation split:** 5% held out from training data
+Python, Java, and JavaScript data from `codeparrot/github-code-clean` was used **only for evaluation** (2,500 samples each).
+### Training Procedure
+The model was fine-tuned with the standard GraphCodeBERT MLM objective, which combines:
+1. **MLM loss** — predict randomly masked tokens in the code sequence
+2. **Edge prediction loss** — predict edges in the data flow graph (DFG)
+#### Training Hyperparameters
+| Hyperparameter         | Value   |
+|------------------------|---------|
+| Batch size             | 32      |
+| Epochs                 | 6       |
+| Learning rate          | 2e-5    |
+| Max sequence length    | 256     |
+| Warmup steps           | 2,000   |
+| MLM probability        | 0.15    |
+| Validation split       | 0.05    |
+| Weight decay           | 0.01    |
+| Early stopping patience| 3       |
+| Training precision     | fp32    |
+#### Training Loss Progression
+| Epoch | Train Total Loss | Val Total Loss | Train MLM Loss | Val MLM Loss |
+|-------|-----------------|----------------|----------------|--------------|
+| 0     | 0.8930          | 0.5168         | 0.6517         | 0.4587       |
+| 1     | 0.5258          | 0.4384         | 0.4806         | 0.4178       |
+| 2     | 0.4645          | 0.3999         | 0.4418         | 0.3893       |
+| 3     | 0.4346          | 0.3862         | 0.4191         | 0.3773       |
+| 4     | 0.4181          | 0.3753         | 0.4062         | 0.3697       |
+| 5     | 0.4100          | **0.3701**     | 0.3997         | 0.3650       |
+Best checkpoint: **epoch 5** (val loss: 0.3701)
+#### Hardware
+- **GPU:** NVIDIA Quadro RTX 4000 (16 GB VRAM)
+## Evaluation
+### Testing Data, Factors & Metrics
+#### Testing Data
+2,500 samples per language were used for evaluation:
+- **C++** — `codeparrot/github-code-clean`
+- **Erlang** — custom scraped dataset
+- **Python** — `codeparrot/github-code-clean` *(zero-shot)*
+- **Java** — `codeparrot/github-code-clean` *(zero-shot)*
+- **JavaScript** — `codeparrot/github-code-clean` *(zero-shot)*
+#### Metrics
+- **Top-1 Accuracy** — fraction of masked tokens where the model's top prediction matches the original token
+- **Top-5 Accuracy** — fraction of masked tokens where the correct token appears in the top 5 predictions
+- **Perplexity** — exponentiated mean negative log-likelihood over masked tokens (lower is better)
+Evaluation uses a mask ratio of 0.15, consistent with training.
+### Results
+Final evaluation metrics (epoch 6):
+| Language   | Top-1 Acc | Top-5 Acc | Perplexity |
+|------------|----------:|----------:|-----------:|
+| C++        | **88.5%** | **94.2%** | **~1.95**  |
+| Erlang     | 86.5%     | 93.1%     | ~2.05      |
+| Java       | 83.5%     | 91.5%     | ~2.55      |
+| Python     | 77.8%     | 88.6%     | ~3.30      |
+| JavaScript | 76.2%     | 88.6%     | ~3.35      |
+C++ and Erlang (the two training languages) achieve the strongest results. The model shows solid zero-shot transfer to Java, and reasonable transfer to Python and JavaScript despite not being trained on those languages.
+#### Summary
+The model converges steadily across all 6 epochs. C++ and Erlang show the sharpest perplexity improvements in the first two epochs (from ~5.1 → ~2.1 and ~10.5 → ~2.1 respectively), then plateau. Java, Python, and JavaScript perplexity curves are flatter throughout, consistent with zero-shot generalization rather than direct training signal.
+## Environmental Impact
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
+- **Hardware Type:** NVIDIA Quadro RTX 4000 (16 GB VRAM)
+- **Hours used:** 24 hour
+## Technical Specifications
+### Model Architecture and Objective
+GraphCodeBERT uses a 12-layer RoBERTa-style transformer encoder. The model takes as input a concatenation of (1) code token sequences and (2) data flow graph node representations, with a 2D attention mask encoding which tokens and DFG nodes can attend to each other. The training objective is a sum of MLM loss and DFG edge prediction loss.
+### Compute Infrastructure
+#### Hardware
+NVIDIA Quadro RTX 4000, 16 GB VRAM
+#### Software
+- `transformers` (HuggingFace)
+- `torch`
+- `tree-sitter` (for DFG extraction during evaluation)
+## Citation
+If you use this model, please also cite the original GraphCodeBERT paper:
+**BibTeX:**
+```bibtex
+@inproceedings{guo2021graphcodebert,
+  title={GraphCodeBERT: Pre-training Code Representations with Data Flow},
+  author={Guo, Daya and Ren, Shuo and Lu, Shuai and Feng, Zhangyin and Tang, Duyu and Liu, Shujie and Zhou, Long and Duan, Nan and Svyatkovskiy, Alexey and Fu, Shengyu and Tufano, Michele and Deng, Shao Kun and Clement, Colin and Drain, Dawn and Sundaresan, Neel and Yin, Jian and Jiang, Daxin and Zhou, Ming},
+  booktitle={International Conference on Learning Representations},
+  year={2021}
+}
+```
+## Model Card Contact
+[To be added — link your GitHub or HuggingFace profile here]