SciCode
/

dataset-builder

Model card Files Files and versions

xet

Community

DouDou commited on Feb 19

Commit

900dd38

verified ·

1 Parent(s): 8629ccf

Upload data2/README.md with huggingface_hub

Browse files

Files changed (1) hide show

data2/README.md +217 -0

data2/README.md ADDED Viewed

	@@ -0,0 +1,217 @@

+# DATA2: Code-Documentation Alignment Dataset
+## Dataset Overview
+DATA2 is a large-scale code-documentation alignment dataset that pairs function-level code samples with AI-generated documentation strings (docstrings). The dataset contains 500,000 function-level code samples extracted from domain-specific repositories, each paired with a comprehensive docstring generated using Google's Gemini model. This dataset is designed for training and evaluating code documentation generation models, code understanding systems, and documentation quality assessment tools.
+## Dataset Statistics
+- **Total Samples**: 500,000 function-level code samples
+- **Total Data Size**: ~2.9 GB
+- **Data Format**: JSONL (JSON Lines, one JSON object per line)
+- **Encoding**: UTF-8
+## Dataset Structure
+The dataset is stored in JSONL format, where each line contains a complete JSON object representing one function sample with its associated documentation.
+### Data Field Description
+Each JSON object contains the following fields:
+| Field Name | Type | Description |
+|------------|------|-------------|
+| `language` | String | Programming language of the code (e.g., "python", "java", "rust", "cpp") |
+| `name` | String | Function/method name |
+| `qualified_name` | String | Fully qualified name of the function (e.g., "ClassName.method_name") |
+| `file` | String | Absolute file path in the source repository |
+| `start_line` | Integer | Starting line number of the function in the source file |
+| `end_line` | Integer | Ending line number of the function in the source file |
+| `score` | Float | Relevance score for the function (0.0 to 1.0) |
+| `md_summary` | String | Markdown-formatted project summary/README content |
+| `md_score` | Float | Quality score for the project summary (0.0 to 1.0) |
+| `final_score` | Float | Combined final score (score × md_score) |
+| `code_content` | String | Complete function code content (from start_line to end_line) |
+| `results` | Object | Documentation generation results containing: |
+| `results.idx` | Integer | Index of the sample in the generation queue |
+| `results.status` | String | Generation status: "ok" (success), "error" (failed), or "stopped" |
+| `results.output` | String | Generated docstring/documentation (in code block format) |
+### Programming Language Distribution
+Based on a sample analysis, the dataset is primarily composed of:
+- **Python**: ~90.6% (dominant language)
+- **Java**: ~5.2%
+- **Rust**: ~2.5%
+- **C++**: ~1.3%
+- **C**: ~0.5%
+- **Go**: <0.1%
+- **Other languages**: <0.1%
+## Documentation Generation Process
+The documentation strings in this dataset were generated using LLM through the following process:
+1. **Function Extraction**: Functions were extracted from domain-specific repositories based on relevance scores
+2. **Context Preparation**: Each function was paired with its project's README/summary for context
+3. **Prompt Engineering**: A structured prompt was used to guide the model in generating comprehensive docstrings
+4. **Generation**: The LLM generated detailed docstrings following Python docstring conventions
+5. **Quality Control**: Generated documentation was validated and aligned with the original code
+### Documentation Format
+The generated docstrings follow a structured format including:
+- **Function Purpose**: Clear explanation of what the function does
+- **Parameters**: Detailed parameter descriptions with types and meanings
+- **Return Values**: Return type and value descriptions
+- **Side Effects**: Important side effects or state changes
+- **Exceptions**: Potential exceptions and error conditions
+- **Assumptions**: Constraints and assumptions about inputs
+- **Notes**: Additional context and implementation details
+## Data Source
+The dataset is derived from domain-specific code repositories, specifically:
+- **Source**: GitHub repositories filtered from a large-scale domain-specific code collection
+- **Selection Criteria**: Functions were selected based on:
+  - Relevance scores (function-level and project-level)
+  - Code quality indicators
+  - Domain specificity
+- **Coverage**: Functions span multiple domains including biology, chemistry, materials science, medicine, and computational methods
+## Dataset Characteristics
+1. **High-Quality Documentation**: Each function is paired with comprehensive, AI-generated documentation that follows professional standards
+2. **Rich Context**: Documentation is generated with access to both the function code and project-level context (README summaries)
+3. **Diverse Code Types**: Covers various programming languages and coding styles
+4. **Domain-Specific**: Focuses on scientific and technical domains, providing specialized terminology and use cases
+5. **Structured Format**: Consistent JSONL format enables easy parsing and batch processing
+6. **Complete Metadata**: Includes file paths, line numbers, and scoring information for traceability
+## Usage Guidelines
+### Data Loading
+```python
+import jsonlines
+# Load the dataset
+samples = []
+with jsonlines.open('alignment.jsonl', 'r') as reader:
+    for obj in reader:
+        samples.append(obj)
+print(f"Total samples: {len(samples)}")
+```
+### Accessing Code and Documentation
+```python
+# Extract code and documentation for a sample
+sample = samples[0]
+code = sample['code_content']
+function_name = sample['name']
+language = sample['language']
+# Access generated documentation
+if sample['results']['status'] == 'ok':
+    docstring = sample['results']['output']
+    print(f"Function: {function_name}")
+    print(f"Documentation:\n{docstring}")
+```
+### Filtering by Language
+```python
+# Filter Python functions only
+python_samples = [
+    s for s in samples
+    if s['language'] == 'python' and s['results']['status'] == 'ok'
+]
+print(f"Python samples with documentation: {len(python_samples)}")
+```
+### Filtering by Quality Score
+```python
+# Filter high-quality samples
+high_quality = [
+    s for s in samples
+    if s['final_score'] > 0.15 and s['results']['status'] == 'ok'
+]
+print(f"High-quality samples: {len(high_quality)}")
+```
+### Extracting Documentation Only
+```python
+# Extract all successful documentation strings
+documentations = []
+for sample in samples:
+    if sample['results']['status'] == 'ok':
+        doc = {
+            'function_name': sample['name'],
+            'qualified_name': sample['qualified_name'],
+            'language': sample['language'],
+            'code': sample['code_content'],
+            'docstring': sample['results']['output']
+        }
+        documentations.append(doc)
+```
+## Use Cases
+This dataset is suitable for:
+1. **Code Documentation Generation**: Training models to generate docstrings from code
+2. **Documentation Quality Assessment**: Evaluating the quality of generated documentation
+3. **Code Understanding**: Training models to understand code semantics
+4. **Documentation Completion**: Fine-tuning models for automatic documentation generation
+5. **Code-to-Documentation Alignment**: Studying the relationship between code and documentation
+6. **Domain-Specific NLP**: Training models on scientific and technical terminology
+## Important Notes
+1. **File Size**: The dataset file is large (~2.9 GB), ensure sufficient memory and storage when loading
+2. **JSONL Format**: Each line is a complete JSON object; the file can be processed line-by-line for memory efficiency
+3. **Status Field**: Always check `results.status` before using `results.output`; only "ok" status indicates successful generation
+4. **Code Content**: The `code_content` field contains the complete function code, which may include long implementations
+5. **Documentation Format**: Generated documentation is in markdown code block format (```python ... ```); you may need to extract the content
+6. **Context Dependency**: Documentation quality may vary based on the availability and quality of project README summaries
+## Data Processing Example
+```python
+import jsonlines
+import re
+def extract_docstring_content(docstring_block):
+    """Extract docstring content from markdown code block."""
+    # Remove markdown code block markers
+    pattern = r'```(?:python|code)?\s*(.*?)```'
+    match = re.search(pattern, docstring_block, re.DOTALL)
+    if match:
+        return match.group(1).strip()
+    return docstring_block.strip()
+# Process dataset and extract clean docstrings
+processed_samples = []
+with jsonlines.open('alignment.jsonl', 'r') as reader:
+    for obj in reader:
+        if obj['results']['status'] == 'ok':
+            clean_docstring = extract_docstring_content(obj['results']['output'])
+            processed_samples.append({
+                'function': obj['name'],
+                'code': obj['code_content'],
+                'docstring': clean_docstring,
+                'language': obj['language']
+            })
+```