Upload 4 files

Browse files

Files changed (5) hide show

.gitattributes +3 -0
the_vault_dataset/README.md +326 -0
the_vault_dataset/test.json +3 -0
the_vault_dataset/train_small.json +3 -0
the_vault_dataset/validate.json +3 -0

.gitattributes CHANGED Viewed

@@ -21,3 +21,6 @@ logs/model_glen_vault/GLEN_P2_full/checkpoint-6/model.safetensors filter=lfs dif
 logs/model_glen_vault/GLEN_P2_full/checkpoint-6/optimizer.pt filter=lfs diff=lfs merge=lfs -text
 logs/model_glen_vault/GLEN_P2_full/checkpoint-7/model.safetensors filter=lfs diff=lfs merge=lfs -text
 logs/model_glen_vault/GLEN_P2_full/checkpoint-7/optimizer.pt filter=lfs diff=lfs merge=lfs -text

 logs/model_glen_vault/GLEN_P2_full/checkpoint-6/optimizer.pt filter=lfs diff=lfs merge=lfs -text
 logs/model_glen_vault/GLEN_P2_full/checkpoint-7/model.safetensors filter=lfs diff=lfs merge=lfs -text
 logs/model_glen_vault/GLEN_P2_full/checkpoint-7/optimizer.pt filter=lfs diff=lfs merge=lfs -text
+the_vault_dataset/test.json filter=lfs diff=lfs merge=lfs -text
+the_vault_dataset/train_small.json filter=lfs diff=lfs merge=lfs -text
+the_vault_dataset/validate.json filter=lfs diff=lfs merge=lfs -text

the_vault_dataset/README.md ADDED Viewed

	@@ -0,0 +1,326 @@

+---
+language:
+- code
+- en
+multilinguality:
+- multiprogramming languages
+task_categories:
+- text-generation
+license: mit
+dataset_info:
+  features:
+  - name: identifier
+    dtype: string
+  - name: return_type
+    dtype: string
+  - name: repo
+    dtype: string
+  - name: path
+    dtype: string
+  - name: language
+    dtype: string
+  - name: code
+    dtype: string
+  - name: code_tokens
+    dtype: string
+  - name: original_docstring
+    dtype: string
+  - name: comment
+    dtype: string
+  - name: docstring_tokens
+    dtype: string
+  - name: docstring
+    dtype: string
+  - name: original_string
+    dtype: string
+pretty_name: The Vault Function
+viewer: true
+---
+## Table of Contents
+- [Dataset Description](#dataset-description)
+- [Dataset Summary](#dataset-summary)
+- [Supported Tasks](#supported-tasks)
+- [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-fields)
+  - [Data Splits](#data-splits)
+- [Dataset Statistics](#dataset-statistics)
+- [Usage](#usage)
+- [Additional Information](#additional-information)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+## Dataset Description
+- **Repository:** [FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault)
+- **Paper:** [The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation](https://arxiv.org/abs/2305.06156)
+- **Contact:** support.ailab@fpt.com
+- **Website:** https://www.fpt-aicenter.com/ai-residency/
+<p align="center">
+  <img src="https://raw.githubusercontent.com/FSoft-AI4Code/TheVault/main/assets/the-vault-4-logo-png.png" width="300px" alt="logo">
+</p>
+<div align="center">
+# The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
+</div>
+## Dataset Summary
+The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.
+We provide The Vault which contains code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.
+## Supported Tasks
+The Vault can be used for pretraining LLMs or downstream code-text interaction tasks. A number of tasks related to code understanding and geneartion can be constructed using The Vault such as *code summarization*, *text-to-code generation* and *code search*.
+## Languages
+The natural language text (docstring) is in English.
+10 programming languages are supported in The Vault: `Python`, `Java`, `JavaScript`, `PHP`, `C`, `C#`, `C++`, `Go`, `Ruby`, `Rust`
+## Dataset Structure
+### Data Instances
+```
+{
+    "hexsha": "5c47f0b4c173a8fd03e4e633d9b3dd8211e67ad0",
+    "repo": "neumanna94/beepboop",
+    "path": "js/scripts.js",
+    "license": [
+        "MIT"
+    ],
+    "language": "JavaScript",
+    "identifier": "beepBoopSelector",
+    "return_type": "<not_specific>",
+    "original_string": "function beepBoopSelector(inputString, bbFunction){\n  if(bbFunction==1){\n    return beepBoop(inputString);\n  } else if(bbFunction==2){\n    return beepBoop2(inputString);\n  } else if(bbFunction==3){\n    return beepBoop3(inputString);\n  } else {\n  }\n}",
+    "original_docstring": "//Determines what beepBoop function to use",
+    "docstring": "Determines what beepBoop function to use",
+    "docstring_tokens": [
+        "Determines",
+        "what",
+        "beepBoop",
+        "function",
+        "to",
+        "use"
+    ],
+    "code": "function beepBoopSelector(inputString, bbFunction){\n  if(bbFunction==1){\n    return beepBoop(inputString);\n  } else if(bbFunction==2){\n    return beepBoop2(inputString);\n  } else if(bbFunction==3){\n    return beepBoop3(inputString);\n  } else {\n  }\n}",
+    "code_tokens": [
+        "function",
+        "beepBoopSelector",
+        "(",
+        "inputString",
+        ",",
+        "bbFunction",
+        ")",
+        "{",
+        "if",
+        "(",
+        "bbFunction",
+        "==",
+        "1",
+        ")",
+        "{",
+        "return",
+        "beepBoop",
+        "(",
+        "inputString",
+        ")",
+        ";",
+        "}",
+        "else",
+        "if",
+        "(",
+        "bbFunction",
+        "==",
+        "2",
+        ")",
+        "{",
+        "return",
+        "beepBoop2",
+        "(",
+        "inputString",
+        ")",
+        ";",
+        "}",
+        "else",
+        "if",
+        "(",
+        "bbFunction",
+        "==",
+        "3",
+        ")",
+        "{",
+        "return",
+        "beepBoop3",
+        "(",
+        "inputString",
+        ")",
+        ";",
+        "}",
+        "else",
+        "{",
+        "}",
+        "}"
+    ],
+    "short_docstring": "Determines what beepBoop function to use",
+    "short_docstring_tokens": [
+        "Determines",
+        "what",
+        "beepBoop",
+        "function",
+        "to",
+        "use"
+    ],
+    "comment": [],
+    "parameters": [
+        {
+            "param": "inputString",
+            "type": null
+        },
+        {
+            "param": "bbFunction",
+            "type": null
+        }
+    ],
+    "docstring_params": {
+        "returns": [],
+        "raises": [],
+        "params": [
+            {
+                "identifier": "inputString",
+                "type": null,
+                "docstring": null,
+                "docstring_tokens": [],
+                "default": null,
+                "is_optional": null
+            },
+            {
+                "identifier": "bbFunction",
+                "type": null,
+                "docstring": null,
+                "docstring_tokens": [],
+                "default": null,
+                "is_optional": null
+            }
+        ],
+        "outlier_params": [],
+        "others": []
+    }
+}
+```
+### Data Fields
+Data fields for function level:
+- **hexsha** (string): the unique git hash of file
+- **repo** (string): the owner/repo
+- **path** (string): the full path to the original file
+- **license** (list): licenses in the repo
+- **language** (string): the programming language
+- **identifier** (string): the function or method name
+- **return_type** (string): the type returned by the function
+- **original_string** (string): original version of function/class node
+- **original_docstring** (string): the raw string before tokenization or parsing
+- **code** (string): the part of the original that is code
+- **code_tokens** (list): tokenized version of `code`
+- **short_docstring** (string): short, brief summarization (first line of the docstring)
+- **short_docstring_tokens** (list): tokenized version of `short_docstring`
+- **docstring** (string): the top-level comment or docstring (docstring version without param's doc, return, exception fields, etc)
+- **docstring_tokens** (list): tokenized version of docstring
+- **comment** (list): list of comments (line) inside the function/class
+- **parameters** (list): List of parameters and its type (type can be None)
+- **docstring_params** (dict): Dictionary of the parsed information from docstring
+See [here](https://github.com/FSoft-AI4Code/TheVault/blob/main/data/README.md) for more details and examples.
+### Data Splits
+In this repo, The Vault is divided into 5 subsets, where three training versions are split based on size of the full training set, and the remains are validation set and test set (approximate 20,000 samples in each). The statistic for languages in each split set is illustrated in the following section.
+Before split, the dataset is deduplicated. There are 3 versions of training set that are small (5%), medium (20%) and large (100%).
+## Dataset Statistics
+- Compare to other benchmarks
+| Dataset                   | #Language | #Code-text pair |
+|:--------------------------|----------:|-----------------:|
+| PyMT5                     | 1         | ≈ 7,700,000      |
+| CoDesc                    | 1         | 4,211,516        |
+| CodeSearchNet             | 6         | 2,326,976        |
+| CodeSearchNet (CodeXGLUE) | 6         | 1,005,474        |
+| Deepcom                   | 1         | 424,028          |
+| CONCODE                   | 1         | 2,184,310        |
+| Funcom                    | 1         | 2,149,121        |
+| CodeT5                    | 8         | 3,158,313        |
+| **The Vault**             | **10**    | **34,098,775**   |
+- Statistic for split sets
+|            | train/small | train/medium | train/full | validation | test   | total         |
+|:-----------|------------:|-------------:|-----------:|-----------:|-------:|--------------:|
+|Python      |   370,657   |  1,952,110   | 7,772,647  | 30,992     | 21,652 | 7,825,291     |
+|Java        |   351,213   |  1,612,366   | 6,629,193  | 22,677     | 15,552 | 6,667,422     |
+|JavaScript  |    82,931   |    404,729   | 1,640,416  | 22,044     | 21,108 | 1,683,568     |
+|PHP         |   236,638   |  1,155,476   | 4,656,371  | 21,375     | 19,010 | 4,696,756     |
+|C           |   105,978   |    381,207   | 1,639,319  | 27,525     | 19,122 | 1,685,966     |
+|C#          |   141,090   |    783,166   | 3,305,891  | 24,787     | 19,638 | 3,350,316     |
+|C++         |    87,420   |    410,907   | 1,671,268  | 20,011     | 18,169 | 1,709,448     |
+|Go          |   267,535   |  1,319,547   | 5,109,020  | 19,102     | 25,314 | 5,153,436     |
+|Ruby        |    23,921   |    112,574   |   424,339  | 17,338     | 19,908 |   461,585     |
+|Rust        |    35,367   |    224,015   |   825,130  | 16,716     | 23,141 |   864,987     |
+|TOTAL       | 1,702,750   |  8,356,097   |33,673,594  |222,567     |202,614 |**34,098,775** |
+## Usage
+You can load The Vault dataset using datasets library: ```pip install datasets```
+```python
+from datasets import load_dataset
+# Load full function level dataset (34M samples)
+dataset = load_dataset("Fsoft-AIC/the-vault-function")
+# Load function level train/validation/test set
+dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"])
+# Load "small" (or "medium", "full") version of function level training set
+dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train/small"])
+# specific language (e.g. Python)
+dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"], languages=['python'])
+# dataset streaming
+data = load_dataset("Fsoft-AIC/the-vault-function", split_set= ["train"], streaming= True)
+for sample in iter(data['train']):
+    print(sample)
+```
+A back up dataset can be downloaded in azure storage. See [Download The Vault from Azure blob storage](https://github.com/FSoft-AI4Code/TheVault#download-via-link).
+## Additional information
+### Licensing Information
+MIT License
+### Citation Information
+```
+@article{manh2023vault,
+  title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
+  author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
+  journal={arXiv preprint arXiv:2305.06156},
+  year={2023}
+}
+```
+### Contributions
+This dataset is developed by [FSOFT AI4Code team](https://github.com/FSoft-AI4Code).

the_vault_dataset/test.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:913432b92cd1662030e6da8336f1d89a5bd1671ccea98a94207021d33bd6d780
+size 824321169

the_vault_dataset/train_small.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:efcf6ab58bc3a9a721db07410ee2190baf9f2f3960a80d00978b7f2856e4c5e7
+size 6981341785

the_vault_dataset/validate.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:43153718bfbf720eab66761e5cd635659c62154d8461071e88a16cf5fe462741
+size 893956149