based-CodeBERTa-language-id-llm-module_uniVienna

This model is a fine-tuned version of malteklaes/based-CodeBERTa-language-id-llm-module.

Model description and Framework version

based on model malteklaes/based-CodeBERTa-language-id-llm-module (7 programming languages), which in turn is based on huggingface/CodeBERTa-language-id (6 programming languages)
model details:

RobertaTokenizerFast(name_or_path='malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna', vocab_size=52000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
    0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    4: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}

complete model-config:

RobertaConfig {
  "_name_or_path": "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna",
  "_num_labels": 7,
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "go",
    "1": "java",
    "2": "javascript",
    "3": "php",
    "4": "python",
    "5": "ruby",
    "6": "cpp"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "cpp": 6,
    "go": 0,
    "java": 1,
    "javascript": 2,
    "php": 3,
    "python": 4,
    "ruby": 5
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.39.3",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

Intended uses & limitations

For a given code, the following programming language can be determined:

Go
Java
Javascript
PHP
Python
Ruby
C++

Usage

checkpoint = "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
modelPOST = AutoTokenizer.from_pretrained(checkpoint)

myPipeline = TextClassificationPipeline(
    model=AutoModelForSequenceClassification.from_pretrained(checkpoint, ignore_mismatched_sizes=True),
    tokenizer=AutoTokenizer.from_pretrained(checkpoint)
)

CODE_TO_IDENTIFY_py = """
def is_prime(n):
    if n <= 1:
        return False
    if n == 2 or n == 3:
        return True
    if n % 2 == 0:
        return False
    max_divisor = int(n ** 0.5)
    for i in range(3, max_divisor + 1, 2):
        if n % i == 0:
            return False
    return True

number = 17
if is_prime(number):
    print(f"{number} is a prime number.")
else:
    print(f"{number} is not a prime number.")

"""

myPipeline(CODE_TO_IDENTIFY_py) # output: [{'label': 'python', 'score': 0.9999967813491821}]

Training and evaluation data

Training-Datasets used

for Go, Java, Javascript, PHP, Python, Ruby: code_search_net
for C++: malteklaes/cpp-code-code_search_net-style

Training procedure

machine: GPU T4 (Google Colab)
- system-RAM: 4.7/12.7 GB (during training)
- GPU-RAM: 2.8/15.0GB
- Drive: 69.5/78.5 GB (during training due to complete )
trainer.train(): [x/24136 xx:xx < 31:12, 12.92 it/s, Epoch 0.01/1]
- total 24136 iterations

Training note

Although this model is based on the predecessors mentioned above, this model had to be trained from scratch because the config.json and labels of the original model were changed from 6 to 7 programming languages.

Training hyperparameters

The following hyperparameters were used during training (training args):

training_args = TrainingArguments(
    output_dir="./based-CodeBERTa-language-id-llm-module_uniVienna",
    overwrite_output_dir=True,
    num_train_epochs=0.1,
    per_device_train_batch_size=8,
    save_steps=500,
    save_total_limit=2,
)

Training results

output:

TrainOutput(global_step=24136, training_loss=0.005988701689750161, metrics={'train_runtime': 1936.0586, 'train_samples_per_second': 99.731, 'train_steps_per_second': 12.467, 'total_flos': 3197518224531456.0, 'train_loss': 0.005988701689750161, 'epoch': 0.1})

Downloads last month: 4

Safetensors

Model size

83.5M params

Tensor type

F32

Model tree for malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna

Base model

malteklaes/based-CodeBERTa-language-id-llm-module

Finetuned

(1)

this model

malteklaes
/

based-CodeBERTa-language-id-llm-module_uniVienna