DouDou commited on
Commit
900dd38
·
verified ·
1 Parent(s): 8629ccf

Upload data2/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. data2/README.md +217 -0
data2/README.md ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DATA2: Code-Documentation Alignment Dataset
2
+
3
+ ## Dataset Overview
4
+
5
+ DATA2 is a large-scale code-documentation alignment dataset that pairs function-level code samples with AI-generated documentation strings (docstrings). The dataset contains 500,000 function-level code samples extracted from domain-specific repositories, each paired with a comprehensive docstring generated using Google's Gemini model. This dataset is designed for training and evaluating code documentation generation models, code understanding systems, and documentation quality assessment tools.
6
+
7
+ ## Dataset Statistics
8
+
9
+ - **Total Samples**: 500,000 function-level code samples
10
+ - **Total Data Size**: ~2.9 GB
11
+ - **Data Format**: JSONL (JSON Lines, one JSON object per line)
12
+ - **Encoding**: UTF-8
13
+
14
+ ## Dataset Structure
15
+
16
+ The dataset is stored in JSONL format, where each line contains a complete JSON object representing one function sample with its associated documentation.
17
+
18
+ ### Data Field Description
19
+
20
+ Each JSON object contains the following fields:
21
+
22
+ | Field Name | Type | Description |
23
+ |------------|------|-------------|
24
+ | `language` | String | Programming language of the code (e.g., "python", "java", "rust", "cpp") |
25
+ | `name` | String | Function/method name |
26
+ | `qualified_name` | String | Fully qualified name of the function (e.g., "ClassName.method_name") |
27
+ | `file` | String | Absolute file path in the source repository |
28
+ | `start_line` | Integer | Starting line number of the function in the source file |
29
+ | `end_line` | Integer | Ending line number of the function in the source file |
30
+ | `score` | Float | Relevance score for the function (0.0 to 1.0) |
31
+ | `md_summary` | String | Markdown-formatted project summary/README content |
32
+ | `md_score` | Float | Quality score for the project summary (0.0 to 1.0) |
33
+ | `final_score` | Float | Combined final score (score × md_score) |
34
+ | `code_content` | String | Complete function code content (from start_line to end_line) |
35
+ | `results` | Object | Documentation generation results containing: |
36
+ | `results.idx` | Integer | Index of the sample in the generation queue |
37
+ | `results.status` | String | Generation status: "ok" (success), "error" (failed), or "stopped" |
38
+ | `results.output` | String | Generated docstring/documentation (in code block format) |
39
+
40
+ ### Programming Language Distribution
41
+
42
+ Based on a sample analysis, the dataset is primarily composed of:
43
+
44
+ - **Python**: ~90.6% (dominant language)
45
+ - **Java**: ~5.2%
46
+ - **Rust**: ~2.5%
47
+ - **C++**: ~1.3%
48
+ - **C**: ~0.5%
49
+ - **Go**: <0.1%
50
+ - **Other languages**: <0.1%
51
+
52
+ ## Documentation Generation Process
53
+
54
+ The documentation strings in this dataset were generated using LLM through the following process:
55
+
56
+ 1. **Function Extraction**: Functions were extracted from domain-specific repositories based on relevance scores
57
+ 2. **Context Preparation**: Each function was paired with its project's README/summary for context
58
+ 3. **Prompt Engineering**: A structured prompt was used to guide the model in generating comprehensive docstrings
59
+ 4. **Generation**: The LLM generated detailed docstrings following Python docstring conventions
60
+ 5. **Quality Control**: Generated documentation was validated and aligned with the original code
61
+
62
+ ### Documentation Format
63
+
64
+ The generated docstrings follow a structured format including:
65
+
66
+ - **Function Purpose**: Clear explanation of what the function does
67
+ - **Parameters**: Detailed parameter descriptions with types and meanings
68
+ - **Return Values**: Return type and value descriptions
69
+ - **Side Effects**: Important side effects or state changes
70
+ - **Exceptions**: Potential exceptions and error conditions
71
+ - **Assumptions**: Constraints and assumptions about inputs
72
+ - **Notes**: Additional context and implementation details
73
+
74
+ ## Data Source
75
+
76
+ The dataset is derived from domain-specific code repositories, specifically:
77
+
78
+ - **Source**: GitHub repositories filtered from a large-scale domain-specific code collection
79
+ - **Selection Criteria**: Functions were selected based on:
80
+ - Relevance scores (function-level and project-level)
81
+ - Code quality indicators
82
+ - Domain specificity
83
+ - **Coverage**: Functions span multiple domains including biology, chemistry, materials science, medicine, and computational methods
84
+
85
+ ## Dataset Characteristics
86
+
87
+ 1. **High-Quality Documentation**: Each function is paired with comprehensive, AI-generated documentation that follows professional standards
88
+ 2. **Rich Context**: Documentation is generated with access to both the function code and project-level context (README summaries)
89
+ 3. **Diverse Code Types**: Covers various programming languages and coding styles
90
+ 4. **Domain-Specific**: Focuses on scientific and technical domains, providing specialized terminology and use cases
91
+ 5. **Structured Format**: Consistent JSONL format enables easy parsing and batch processing
92
+ 6. **Complete Metadata**: Includes file paths, line numbers, and scoring information for traceability
93
+
94
+ ## Usage Guidelines
95
+
96
+ ### Data Loading
97
+
98
+ ```python
99
+ import jsonlines
100
+
101
+ # Load the dataset
102
+ samples = []
103
+ with jsonlines.open('alignment.jsonl', 'r') as reader:
104
+ for obj in reader:
105
+ samples.append(obj)
106
+
107
+ print(f"Total samples: {len(samples)}")
108
+ ```
109
+
110
+ ### Accessing Code and Documentation
111
+
112
+ ```python
113
+ # Extract code and documentation for a sample
114
+ sample = samples[0]
115
+
116
+ code = sample['code_content']
117
+ function_name = sample['name']
118
+ language = sample['language']
119
+
120
+ # Access generated documentation
121
+ if sample['results']['status'] == 'ok':
122
+ docstring = sample['results']['output']
123
+ print(f"Function: {function_name}")
124
+ print(f"Documentation:\n{docstring}")
125
+ ```
126
+
127
+ ### Filtering by Language
128
+
129
+ ```python
130
+ # Filter Python functions only
131
+ python_samples = [
132
+ s for s in samples
133
+ if s['language'] == 'python' and s['results']['status'] == 'ok'
134
+ ]
135
+
136
+ print(f"Python samples with documentation: {len(python_samples)}")
137
+ ```
138
+
139
+ ### Filtering by Quality Score
140
+
141
+ ```python
142
+ # Filter high-quality samples
143
+ high_quality = [
144
+ s for s in samples
145
+ if s['final_score'] > 0.15 and s['results']['status'] == 'ok'
146
+ ]
147
+
148
+ print(f"High-quality samples: {len(high_quality)}")
149
+ ```
150
+
151
+ ### Extracting Documentation Only
152
+
153
+ ```python
154
+ # Extract all successful documentation strings
155
+ documentations = []
156
+ for sample in samples:
157
+ if sample['results']['status'] == 'ok':
158
+ doc = {
159
+ 'function_name': sample['name'],
160
+ 'qualified_name': sample['qualified_name'],
161
+ 'language': sample['language'],
162
+ 'code': sample['code_content'],
163
+ 'docstring': sample['results']['output']
164
+ }
165
+ documentations.append(doc)
166
+ ```
167
+
168
+ ## Use Cases
169
+
170
+ This dataset is suitable for:
171
+
172
+ 1. **Code Documentation Generation**: Training models to generate docstrings from code
173
+ 2. **Documentation Quality Assessment**: Evaluating the quality of generated documentation
174
+ 3. **Code Understanding**: Training models to understand code semantics
175
+ 4. **Documentation Completion**: Fine-tuning models for automatic documentation generation
176
+ 5. **Code-to-Documentation Alignment**: Studying the relationship between code and documentation
177
+ 6. **Domain-Specific NLP**: Training models on scientific and technical terminology
178
+
179
+ ## Important Notes
180
+
181
+ 1. **File Size**: The dataset file is large (~2.9 GB), ensure sufficient memory and storage when loading
182
+ 2. **JSONL Format**: Each line is a complete JSON object; the file can be processed line-by-line for memory efficiency
183
+ 3. **Status Field**: Always check `results.status` before using `results.output`; only "ok" status indicates successful generation
184
+ 4. **Code Content**: The `code_content` field contains the complete function code, which may include long implementations
185
+ 5. **Documentation Format**: Generated documentation is in markdown code block format (```python ... ```); you may need to extract the content
186
+ 6. **Context Dependency**: Documentation quality may vary based on the availability and quality of project README summaries
187
+
188
+ ## Data Processing Example
189
+
190
+ ```python
191
+ import jsonlines
192
+ import re
193
+
194
+ def extract_docstring_content(docstring_block):
195
+ """Extract docstring content from markdown code block."""
196
+ # Remove markdown code block markers
197
+ pattern = r'```(?:python|code)?\s*(.*?)```'
198
+ match = re.search(pattern, docstring_block, re.DOTALL)
199
+ if match:
200
+ return match.group(1).strip()
201
+ return docstring_block.strip()
202
+
203
+ # Process dataset and extract clean docstrings
204
+ processed_samples = []
205
+ with jsonlines.open('alignment.jsonl', 'r') as reader:
206
+ for obj in reader:
207
+ if obj['results']['status'] == 'ok':
208
+ clean_docstring = extract_docstring_content(obj['results']['output'])
209
+ processed_samples.append({
210
+ 'function': obj['name'],
211
+ 'code': obj['code_content'],
212
+ 'docstring': clean_docstring,
213
+ 'language': obj['language']
214
+ })
215
+ ```
216
+
217
+