| | --- |
| | language: |
| | - code |
| | - en |
| | task_categories: |
| | - text-classification |
| | tags: |
| | - arxiv:2305.06156 |
| | license: mit |
| | metrics: |
| | - accuracy |
| | widget: |
| | - text: |- |
| | Sum two integers</s></s>def sum(a, b): |
| | return a + b |
| | example_title: Simple toy |
| | - text: |- |
| | Look for methods that might be dynamically defined and define them for lookup.</s></s>def respond_to_missing?(name, include_private = false) |
| | if name == :to_ary || name == :empty? |
| | false |
| | else |
| | return true if mapping(name).present? |
| | mounting = all_mountings.find{ |mount| mount.respond_to?(name) } |
| | return false if mounting.nil? |
| | end |
| | end |
| | example_title: Ruby example |
| | - text: |- |
| | Method that adds a candidate to the party @param c the candidate that will be added to the party</s></s>public void addCandidate(Candidate c) |
| | { |
| | this.votes += c.getVotes(); |
| | candidates.add(c); |
| | } |
| | example_title: Java example |
| | - text: |- |
| | we do not need Buffer pollyfill for now</s></s>function(str){ |
| | var ret = new Array(str.length), len = str.length; |
| | while(len--) ret[len] = str.charCodeAt(len); |
| | return Uint8Array.from(ret); |
| | } |
| | example_title: JavaScript example |
| | |
| | pipeline_tag: text-classification |
| | --- |
| | |
| |
|
| |
|
| | ## Table of Contents |
| | - [Model Description](#model-description) |
| | - [Model Details](#model-details) |
| | - [Usage](#usage) |
| | - [Limitations](#limitations) |
| | - [Additional Information](#additional-information) |
| | - [Licensing Information](#licensing-information) |
| | - [Citation Information](#citation-information) |
| |
|
| |
|
| | ## Model Description |
| |
|
| | This model is developed based on [Codebert](https://github.com/microsoft/CodeBERT) and a 5M subset of [The Vault](https://huggingface.co/datasets/Fsoft-AIC/the-vault-function) to detect the inconsistency between docstring/comment and function. It is used to remove noisy examples in The Vault dataset. |
| |
|
| | More information: |
| | - **Repository:** [FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault) |
| | - **Paper:** [The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation](https://arxiv.org/abs/2305.06156) |
| | - **Contact:** support.ailab@fpt.com |
| |
|
| |
|
| | ## Model Details |
| | * Developed by: [Fsoft AI Center](https://www.fpt-aicenter.com/ai-residency/) |
| | * License: MIT |
| | * Model type: Transformer-Encoder based Language Model |
| | * Architecture: BERT-base |
| | * Data set: [The Vault](https://huggingface.co/datasets/Fsoft-AIC/the-vault-function) |
| | * Tokenizer: Byte Pair Encoding |
| | * Vocabulary Size: 50265 |
| | * Sequence Length: 512 |
| | * Language: English and 10 Programming languages (Python, Java, JavaScript, PHP, C#, C, C++, Go, Rust, Ruby) |
| | * Training details: |
| | * Self-supervised learning, binary classification |
| | * Positive class: Original code-docstring pair |
| | * Negative class: Random pairing code and docstring |
| |
|
| | ## Usage |
| | The input to the model follows the below template: |
| | ```python |
| | """ |
| | Template: |
| | <s>{docstring}</s></s>{code}</s> |
| | |
| | Example: |
| | from transformers import AutoTokenizer |
| | |
| | #Load tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency") |
| | |
| | input = "<s>Sum two integers</s></s>def sum(a, b):\n return a + b</s>" |
| | tokenized_input = tokenizer(input, add_special_tokens= False) |
| | """ |
| | ``` |
| |
|
| | Using model with Jax and Pytorch |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification, FlaxAutoModelForSequenceClassification |
| | |
| | #Load model with jax |
| | model = FlaxAutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency") |
| | |
| | #Load model with torch |
| | model = AutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency") |
| | ``` |
| |
|
| | ## Limitations |
| | This model is trained on 5M subset of The Vault in a self-supervised manner. Since the negative samples are generated artificially, the model's ability to identify instances that require a strong semantic understanding between the code and the docstring might be restricted. |
| |
|
| | It is hard to evaluate the model due to the unavailable labeled datasets. GPT-3.5-turbo is adopted as a reference to measure the correlation between the model and GPT-3.5-turbo's scores. However, the result could be influenced by GPT-3.5-turbo's potential biases and ambiguous conditions. Therefore, we recommend having human labeling dataset and fine-tune this model to achieve the best result. |
| |
|
| | ## Additional information |
| | ### Licensing Information |
| |
|
| | MIT License |
| |
|
| | ### Citation Information |
| |
|
| | ``` |
| | @article{manh2023vault, |
| | title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation}, |
| | author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ}, |
| | journal={arXiv preprint arXiv:2305.06156}, |
| | year={2023} |
| | } |
| | ``` |