# DATA2: Code-Documentation Alignment Dataset ## Dataset Overview DATA2 is a large-scale code-documentation alignment dataset that pairs function-level code samples with AI-generated documentation strings (docstrings). The dataset contains 500,000 function-level code samples extracted from domain-specific repositories, each paired with a comprehensive docstring generated using Google's Gemini model. This dataset is designed for training and evaluating code documentation generation models, code understanding systems, and documentation quality assessment tools. ## Dataset Statistics - **Total Samples**: 500,000 function-level code samples - **Total Data Size**: ~2.9 GB - **Data Format**: JSONL (JSON Lines, one JSON object per line) - **Encoding**: UTF-8 ## Dataset Structure The dataset is stored in JSONL format, where each line contains a complete JSON object representing one function sample with its associated documentation. ### Data Field Description Each JSON object contains the following fields: | Field Name | Type | Description | |------------|------|-------------| | `language` | String | Programming language of the code (e.g., "python", "java", "rust", "cpp") | | `name` | String | Function/method name | | `qualified_name` | String | Fully qualified name of the function (e.g., "ClassName.method_name") | | `file` | String | Absolute file path in the source repository | | `start_line` | Integer | Starting line number of the function in the source file | | `end_line` | Integer | Ending line number of the function in the source file | | `score` | Float | Relevance score for the function (0.0 to 1.0) | | `md_summary` | String | Markdown-formatted project summary/README content | | `md_score` | Float | Quality score for the project summary (0.0 to 1.0) | | `final_score` | Float | Combined final score (score × md_score) | | `code_content` | String | Complete function code content (from start_line to end_line) | | `results` | Object | Documentation generation results containing: | | `results.idx` | Integer | Index of the sample in the generation queue | | `results.status` | String | Generation status: "ok" (success), "error" (failed), or "stopped" | | `results.output` | String | Generated docstring/documentation (in code block format) | ### Programming Language Distribution Based on a sample analysis, the dataset is primarily composed of: - **Python**: ~90.6% (dominant language) - **Java**: ~5.2% - **Rust**: ~2.5% - **C++**: ~1.3% - **C**: ~0.5% - **Go**: <0.1% - **Other languages**: <0.1% ## Documentation Generation Process The documentation strings in this dataset were generated using LLM through the following process: 1. **Function Extraction**: Functions were extracted from domain-specific repositories based on relevance scores 2. **Context Preparation**: Each function was paired with its project's README/summary for context 3. **Prompt Engineering**: A structured prompt was used to guide the model in generating comprehensive docstrings 4. **Generation**: The LLM generated detailed docstrings following Python docstring conventions 5. **Quality Control**: Generated documentation was validated and aligned with the original code ### Documentation Format The generated docstrings follow a structured format including: - **Function Purpose**: Clear explanation of what the function does - **Parameters**: Detailed parameter descriptions with types and meanings - **Return Values**: Return type and value descriptions - **Side Effects**: Important side effects or state changes - **Exceptions**: Potential exceptions and error conditions - **Assumptions**: Constraints and assumptions about inputs - **Notes**: Additional context and implementation details ## Data Source The dataset is derived from domain-specific code repositories, specifically: - **Source**: GitHub repositories filtered from a large-scale domain-specific code collection - **Selection Criteria**: Functions were selected based on: - Relevance scores (function-level and project-level) - Code quality indicators - Domain specificity - **Coverage**: Functions span multiple domains including biology, chemistry, materials science, medicine, and computational methods ## Dataset Characteristics 1. **High-Quality Documentation**: Each function is paired with comprehensive, AI-generated documentation that follows professional standards 2. **Rich Context**: Documentation is generated with access to both the function code and project-level context (README summaries) 3. **Diverse Code Types**: Covers various programming languages and coding styles 4. **Domain-Specific**: Focuses on scientific and technical domains, providing specialized terminology and use cases 5. **Structured Format**: Consistent JSONL format enables easy parsing and batch processing 6. **Complete Metadata**: Includes file paths, line numbers, and scoring information for traceability ## Usage Guidelines ### Data Loading ```python import jsonlines # Load the dataset samples = [] with jsonlines.open('alignment.jsonl', 'r') as reader: for obj in reader: samples.append(obj) print(f"Total samples: {len(samples)}") ``` ### Accessing Code and Documentation ```python # Extract code and documentation for a sample sample = samples[0] code = sample['code_content'] function_name = sample['name'] language = sample['language'] # Access generated documentation if sample['results']['status'] == 'ok': docstring = sample['results']['output'] print(f"Function: {function_name}") print(f"Documentation:\n{docstring}") ``` ### Filtering by Language ```python # Filter Python functions only python_samples = [ s for s in samples if s['language'] == 'python' and s['results']['status'] == 'ok' ] print(f"Python samples with documentation: {len(python_samples)}") ``` ### Filtering by Quality Score ```python # Filter high-quality samples high_quality = [ s for s in samples if s['final_score'] > 0.15 and s['results']['status'] == 'ok' ] print(f"High-quality samples: {len(high_quality)}") ``` ### Extracting Documentation Only ```python # Extract all successful documentation strings documentations = [] for sample in samples: if sample['results']['status'] == 'ok': doc = { 'function_name': sample['name'], 'qualified_name': sample['qualified_name'], 'language': sample['language'], 'code': sample['code_content'], 'docstring': sample['results']['output'] } documentations.append(doc) ``` ## Use Cases This dataset is suitable for: 1. **Code Documentation Generation**: Training models to generate docstrings from code 2. **Documentation Quality Assessment**: Evaluating the quality of generated documentation 3. **Code Understanding**: Training models to understand code semantics 4. **Documentation Completion**: Fine-tuning models for automatic documentation generation 5. **Code-to-Documentation Alignment**: Studying the relationship between code and documentation 6. **Domain-Specific NLP**: Training models on scientific and technical terminology ## Important Notes 1. **File Size**: The dataset file is large (~2.9 GB), ensure sufficient memory and storage when loading 2. **JSONL Format**: Each line is a complete JSON object; the file can be processed line-by-line for memory efficiency 3. **Status Field**: Always check `results.status` before using `results.output`; only "ok" status indicates successful generation 4. **Code Content**: The `code_content` field contains the complete function code, which may include long implementations 5. **Documentation Format**: Generated documentation is in markdown code block format (```python ... ```); you may need to extract the content 6. **Context Dependency**: Documentation quality may vary based on the availability and quality of project README summaries ## Data Processing Example ```python import jsonlines import re def extract_docstring_content(docstring_block): """Extract docstring content from markdown code block.""" # Remove markdown code block markers pattern = r'```(?:python|code)?\s*(.*?)```' match = re.search(pattern, docstring_block, re.DOTALL) if match: return match.group(1).strip() return docstring_block.strip() # Process dataset and extract clean docstrings processed_samples = [] with jsonlines.open('alignment.jsonl', 'r') as reader: for obj in reader: if obj['results']['status'] == 'ok': clean_docstring = extract_docstring_content(obj['results']['output']) processed_samples.append({ 'function': obj['name'], 'code': obj['code_content'], 'docstring': clean_docstring, 'language': obj['language'] }) ```