QuanTH02 commited on
Commit
a7cf59e
·
verified ·
1 Parent(s): bf5b0ff

Upload 4 files

Browse files
.gitattributes CHANGED
@@ -21,3 +21,6 @@ logs/model_glen_vault/GLEN_P2_full/checkpoint-6/model.safetensors filter=lfs dif
21
  logs/model_glen_vault/GLEN_P2_full/checkpoint-6/optimizer.pt filter=lfs diff=lfs merge=lfs -text
22
  logs/model_glen_vault/GLEN_P2_full/checkpoint-7/model.safetensors filter=lfs diff=lfs merge=lfs -text
23
  logs/model_glen_vault/GLEN_P2_full/checkpoint-7/optimizer.pt filter=lfs diff=lfs merge=lfs -text
 
 
 
 
21
  logs/model_glen_vault/GLEN_P2_full/checkpoint-6/optimizer.pt filter=lfs diff=lfs merge=lfs -text
22
  logs/model_glen_vault/GLEN_P2_full/checkpoint-7/model.safetensors filter=lfs diff=lfs merge=lfs -text
23
  logs/model_glen_vault/GLEN_P2_full/checkpoint-7/optimizer.pt filter=lfs diff=lfs merge=lfs -text
24
+ the_vault_dataset/test.json filter=lfs diff=lfs merge=lfs -text
25
+ the_vault_dataset/train_small.json filter=lfs diff=lfs merge=lfs -text
26
+ the_vault_dataset/validate.json filter=lfs diff=lfs merge=lfs -text
the_vault_dataset/README.md ADDED
@@ -0,0 +1,326 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - code
4
+ - en
5
+ multilinguality:
6
+ - multiprogramming languages
7
+ task_categories:
8
+ - text-generation
9
+ license: mit
10
+ dataset_info:
11
+ features:
12
+ - name: identifier
13
+ dtype: string
14
+ - name: return_type
15
+ dtype: string
16
+ - name: repo
17
+ dtype: string
18
+ - name: path
19
+ dtype: string
20
+ - name: language
21
+ dtype: string
22
+ - name: code
23
+ dtype: string
24
+ - name: code_tokens
25
+ dtype: string
26
+ - name: original_docstring
27
+ dtype: string
28
+ - name: comment
29
+ dtype: string
30
+ - name: docstring_tokens
31
+ dtype: string
32
+ - name: docstring
33
+ dtype: string
34
+ - name: original_string
35
+ dtype: string
36
+ pretty_name: The Vault Function
37
+ viewer: true
38
+ ---
39
+
40
+
41
+
42
+ ## Table of Contents
43
+ - [Dataset Description](#dataset-description)
44
+ - [Dataset Summary](#dataset-summary)
45
+ - [Supported Tasks](#supported-tasks)
46
+ - [Languages](#languages)
47
+ - [Dataset Structure](#dataset-structure)
48
+ - [Data Instances](#data-instances)
49
+ - [Data Fields](#data-fields)
50
+ - [Data Splits](#data-splits)
51
+ - [Dataset Statistics](#dataset-statistics)
52
+ - [Usage](#usage)
53
+ - [Additional Information](#additional-information)
54
+ - [Licensing Information](#licensing-information)
55
+ - [Citation Information](#citation-information)
56
+ - [Contributions](#contributions)
57
+
58
+
59
+ ## Dataset Description
60
+
61
+ - **Repository:** [FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault)
62
+ - **Paper:** [The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation](https://arxiv.org/abs/2305.06156)
63
+ - **Contact:** support.ailab@fpt.com
64
+ - **Website:** https://www.fpt-aicenter.com/ai-residency/
65
+
66
+ <p align="center">
67
+ <img src="https://raw.githubusercontent.com/FSoft-AI4Code/TheVault/main/assets/the-vault-4-logo-png.png" width="300px" alt="logo">
68
+ </p>
69
+
70
+ <div align="center">
71
+
72
+ # The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
73
+ </div>
74
+
75
+
76
+ ## Dataset Summary
77
+ The Vault dataset is a comprehensive, large-scale, multilingual parallel dataset that features high-quality code-text pairs derived from The Stack, the largest permissively-licensed source code dataset.
78
+
79
+ We provide The Vault which contains code snippets from 10 popular programming languages such as Java, JavaScript, Python, Ruby, Rust, Golang, C#, C++, C, and PHP. This dataset provides multiple code-snippet levels, metadata, and 11 docstring styles for enhanced usability and versatility.
80
+
81
+ ## Supported Tasks
82
+ The Vault can be used for pretraining LLMs or downstream code-text interaction tasks. A number of tasks related to code understanding and geneartion can be constructed using The Vault such as *code summarization*, *text-to-code generation* and *code search*.
83
+
84
+ ## Languages
85
+ The natural language text (docstring) is in English.
86
+
87
+ 10 programming languages are supported in The Vault: `Python`, `Java`, `JavaScript`, `PHP`, `C`, `C#`, `C++`, `Go`, `Ruby`, `Rust`
88
+
89
+ ## Dataset Structure
90
+ ### Data Instances
91
+ ```
92
+ {
93
+
94
+ "hexsha": "5c47f0b4c173a8fd03e4e633d9b3dd8211e67ad0",
95
+ "repo": "neumanna94/beepboop",
96
+ "path": "js/scripts.js",
97
+ "license": [
98
+ "MIT"
99
+ ],
100
+ "language": "JavaScript",
101
+ "identifier": "beepBoopSelector",
102
+ "return_type": "<not_specific>",
103
+ "original_string": "function beepBoopSelector(inputString, bbFunction){\n if(bbFunction==1){\n return beepBoop(inputString);\n } else if(bbFunction==2){\n return beepBoop2(inputString);\n } else if(bbFunction==3){\n return beepBoop3(inputString);\n } else {\n }\n}",
104
+ "original_docstring": "//Determines what beepBoop function to use",
105
+ "docstring": "Determines what beepBoop function to use",
106
+ "docstring_tokens": [
107
+ "Determines",
108
+ "what",
109
+ "beepBoop",
110
+ "function",
111
+ "to",
112
+ "use"
113
+ ],
114
+ "code": "function beepBoopSelector(inputString, bbFunction){\n if(bbFunction==1){\n return beepBoop(inputString);\n } else if(bbFunction==2){\n return beepBoop2(inputString);\n } else if(bbFunction==3){\n return beepBoop3(inputString);\n } else {\n }\n}",
115
+ "code_tokens": [
116
+ "function",
117
+ "beepBoopSelector",
118
+ "(",
119
+ "inputString",
120
+ ",",
121
+ "bbFunction",
122
+ ")",
123
+ "{",
124
+ "if",
125
+ "(",
126
+ "bbFunction",
127
+ "==",
128
+ "1",
129
+ ")",
130
+ "{",
131
+ "return",
132
+ "beepBoop",
133
+ "(",
134
+ "inputString",
135
+ ")",
136
+ ";",
137
+ "}",
138
+ "else",
139
+ "if",
140
+ "(",
141
+ "bbFunction",
142
+ "==",
143
+ "2",
144
+ ")",
145
+ "{",
146
+ "return",
147
+ "beepBoop2",
148
+ "(",
149
+ "inputString",
150
+ ")",
151
+ ";",
152
+ "}",
153
+ "else",
154
+ "if",
155
+ "(",
156
+ "bbFunction",
157
+ "==",
158
+ "3",
159
+ ")",
160
+ "{",
161
+ "return",
162
+ "beepBoop3",
163
+ "(",
164
+ "inputString",
165
+ ")",
166
+ ";",
167
+ "}",
168
+ "else",
169
+ "{",
170
+ "}",
171
+ "}"
172
+ ],
173
+
174
+ "short_docstring": "Determines what beepBoop function to use",
175
+ "short_docstring_tokens": [
176
+ "Determines",
177
+ "what",
178
+ "beepBoop",
179
+ "function",
180
+ "to",
181
+ "use"
182
+ ],
183
+ "comment": [],
184
+ "parameters": [
185
+ {
186
+ "param": "inputString",
187
+ "type": null
188
+ },
189
+ {
190
+ "param": "bbFunction",
191
+ "type": null
192
+ }
193
+ ],
194
+ "docstring_params": {
195
+ "returns": [],
196
+ "raises": [],
197
+ "params": [
198
+ {
199
+ "identifier": "inputString",
200
+ "type": null,
201
+ "docstring": null,
202
+ "docstring_tokens": [],
203
+ "default": null,
204
+ "is_optional": null
205
+ },
206
+ {
207
+ "identifier": "bbFunction",
208
+ "type": null,
209
+ "docstring": null,
210
+ "docstring_tokens": [],
211
+ "default": null,
212
+ "is_optional": null
213
+ }
214
+ ],
215
+ "outlier_params": [],
216
+ "others": []
217
+ }
218
+ }
219
+
220
+ ```
221
+ ### Data Fields
222
+
223
+ Data fields for function level:
224
+ - **hexsha** (string): the unique git hash of file
225
+ - **repo** (string): the owner/repo
226
+ - **path** (string): the full path to the original file
227
+ - **license** (list): licenses in the repo
228
+ - **language** (string): the programming language
229
+ - **identifier** (string): the function or method name
230
+ - **return_type** (string): the type returned by the function
231
+ - **original_string** (string): original version of function/class node
232
+ - **original_docstring** (string): the raw string before tokenization or parsing
233
+ - **code** (string): the part of the original that is code
234
+ - **code_tokens** (list): tokenized version of `code`
235
+ - **short_docstring** (string): short, brief summarization (first line of the docstring)
236
+ - **short_docstring_tokens** (list): tokenized version of `short_docstring`
237
+ - **docstring** (string): the top-level comment or docstring (docstring version without param's doc, return, exception fields, etc)
238
+ - **docstring_tokens** (list): tokenized version of docstring
239
+ - **comment** (list): list of comments (line) inside the function/class
240
+ - **parameters** (list): List of parameters and its type (type can be None)
241
+ - **docstring_params** (dict): Dictionary of the parsed information from docstring
242
+
243
+ See [here](https://github.com/FSoft-AI4Code/TheVault/blob/main/data/README.md) for more details and examples.
244
+
245
+ ### Data Splits
246
+
247
+ In this repo, The Vault is divided into 5 subsets, where three training versions are split based on size of the full training set, and the remains are validation set and test set (approximate 20,000 samples in each). The statistic for languages in each split set is illustrated in the following section.
248
+
249
+ Before split, the dataset is deduplicated. There are 3 versions of training set that are small (5%), medium (20%) and large (100%).
250
+
251
+ ## Dataset Statistics
252
+
253
+ - Compare to other benchmarks
254
+
255
+ | Dataset | #Language | #Code-text pair |
256
+ |:--------------------------|----------:|-----------------:|
257
+ | PyMT5 | 1 | ≈ 7,700,000 |
258
+ | CoDesc | 1 | 4,211,516 |
259
+ | CodeSearchNet | 6 | 2,326,976 |
260
+ | CodeSearchNet (CodeXGLUE) | 6 | 1,005,474 |
261
+ | Deepcom | 1 | 424,028 |
262
+ | CONCODE | 1 | 2,184,310 |
263
+ | Funcom | 1 | 2,149,121 |
264
+ | CodeT5 | 8 | 3,158,313 |
265
+ | **The Vault** | **10** | **34,098,775** |
266
+
267
+ - Statistic for split sets
268
+
269
+
270
+ | | train/small | train/medium | train/full | validation | test | total |
271
+ |:-----------|------------:|-------------:|-----------:|-----------:|-------:|--------------:|
272
+ |Python | 370,657 | 1,952,110 | 7,772,647 | 30,992 | 21,652 | 7,825,291 |
273
+ |Java | 351,213 | 1,612,366 | 6,629,193 | 22,677 | 15,552 | 6,667,422 |
274
+ |JavaScript | 82,931 | 404,729 | 1,640,416 | 22,044 | 21,108 | 1,683,568 |
275
+ |PHP | 236,638 | 1,155,476 | 4,656,371 | 21,375 | 19,010 | 4,696,756 |
276
+ |C | 105,978 | 381,207 | 1,639,319 | 27,525 | 19,122 | 1,685,966 |
277
+ |C# | 141,090 | 783,166 | 3,305,891 | 24,787 | 19,638 | 3,350,316 |
278
+ |C++ | 87,420 | 410,907 | 1,671,268 | 20,011 | 18,169 | 1,709,448 |
279
+ |Go | 267,535 | 1,319,547 | 5,109,020 | 19,102 | 25,314 | 5,153,436 |
280
+ |Ruby | 23,921 | 112,574 | 424,339 | 17,338 | 19,908 | 461,585 |
281
+ |Rust | 35,367 | 224,015 | 825,130 | 16,716 | 23,141 | 864,987 |
282
+ |TOTAL | 1,702,750 | 8,356,097 |33,673,594 |222,567 |202,614 |**34,098,775** |
283
+
284
+ ## Usage
285
+ You can load The Vault dataset using datasets library: ```pip install datasets```
286
+
287
+ ```python
288
+ from datasets import load_dataset
289
+
290
+ # Load full function level dataset (34M samples)
291
+ dataset = load_dataset("Fsoft-AIC/the-vault-function")
292
+
293
+ # Load function level train/validation/test set
294
+ dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"])
295
+
296
+ # Load "small" (or "medium", "full") version of function level training set
297
+ dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train/small"])
298
+
299
+ # specific language (e.g. Python)
300
+ dataset = load_dataset("Fsoft-AIC/the-vault-function", split_set=["train"], languages=['python'])
301
+
302
+ # dataset streaming
303
+ data = load_dataset("Fsoft-AIC/the-vault-function", split_set= ["train"], streaming= True)
304
+ for sample in iter(data['train']):
305
+ print(sample)
306
+ ```
307
+
308
+ A back up dataset can be downloaded in azure storage. See [Download The Vault from Azure blob storage](https://github.com/FSoft-AI4Code/TheVault#download-via-link).
309
+
310
+ ## Additional information
311
+ ### Licensing Information
312
+ MIT License
313
+
314
+ ### Citation Information
315
+
316
+ ```
317
+ @article{manh2023vault,
318
+ title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
319
+ author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
320
+ journal={arXiv preprint arXiv:2305.06156},
321
+ year={2023}
322
+ }
323
+ ```
324
+
325
+ ### Contributions
326
+ This dataset is developed by [FSOFT AI4Code team](https://github.com/FSoft-AI4Code).
the_vault_dataset/test.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:913432b92cd1662030e6da8336f1d89a5bd1671ccea98a94207021d33bd6d780
3
+ size 824321169
the_vault_dataset/train_small.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:efcf6ab58bc3a9a721db07410ee2190baf9f2f3960a80d00978b7f2856e4c5e7
3
+ size 6981341785
the_vault_dataset/validate.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43153718bfbf720eab66761e5cd635659c62154d8461071e88a16cf5fe462741
3
+ size 893956149