SciCode
/

dataset-builder

Model card Files Files and versions

xet

Community

DouDou commited on 21 days ago

Commit

5c31870

verified ·

1 Parent(s): bffe782

Upload data3/README.md with huggingface_hub

Browse files

Files changed (1) hide show

data3/README.md +260 -0

data3/README.md ADDED Viewed

	@@ -0,0 +1,260 @@

+# DATA3: Programming Problems Generation Dataset
+## Dataset Overview
+DATA3 is a large-scale programming problems generation dataset that contains AI-generated programming problems inspired by real scientific computing code snippets. The dataset consists of 22,532 programming problems, each paired with a comprehensive solution. These problems focus on scientific computing concepts such as numerical algorithms, data analysis, mathematical modeling, and computational methods in chemistry, biology, and physics.
+## Dataset Statistics
+- **Total Samples**: 22,532 programming problems
+- **Total Data Size**: ~496 MB
+- **Data Format**: JSONL (JSON Lines, one JSON object per line)
+- **Encoding**: UTF-8
+- **Primary Language**: Python (dominant in source code)
+- **Average Input Tokens**: ~697 tokens per prompt
+- **Average Output Tokens**: ~5,378 tokens per response
+## Dataset Structure
+The dataset is stored in JSONL format, where each line contains a complete JSON object representing one programming problem with its solution.
+### Data Field Description
+Each JSON object contains the following fields:
+| Field Name | Type | Description |
+|------------|------|-------------|
+| `metadata` | Object | Metadata about the source code that inspired the problem |
+| `metadata.original_index` | String | Original index of the source function |
+| `metadata.function_name` | String | Name of the source function |
+| `metadata.repo_name` | String | Repository name (may be empty) |
+| `metadata.path` | String | File path (may be empty) |
+| `metadata.language` | String | Programming language of source code |
+| `metadata.relevance_score` | Integer | Relevance score of the source function |
+| `metadata.function_start_line` | String | Starting line number of the function |
+| `metadata.function_end_line` | String | Ending line number of the function |
+| `prompt` | String | The prompt used to generate the programming problem |
+| `response` | String | Generated response containing problem description and solution |
+| `usage` | Object | API usage statistics for generation |
+| `usage.input_tokens` | Integer | Number of input tokens used |
+| `usage.output_tokens` | Integer | Number of output tokens generated |
+| `usage.total_tokens` | Integer | Total tokens (input + output) |
+| `usage.input_cost` | Float | Cost for input tokens |
+| `usage.output_cost` | Float | Cost for output tokens |
+| `usage.request_cost` | Float | Total cost for the request |
+| `timestamp` | String | ISO format timestamp of generation |
+| `row_number` | Integer | Row number in the dataset |
+### Response Structure
+The `response` field contains a structured markdown document with two main sections:
+1. **Problem Description**: A self-contained problem description that:
+   - Provides all necessary context and background
+   - Clearly states what needs to be implemented
+   - Specifies input/output format and constraints
+   - Explains domain-specific concepts
+   - Does NOT directly reference the original code snippet
+2. **Solution**: A comprehensive Python solution that:
+   - Accurately solves the problem
+   - Includes clear comments explaining the approach
+   - Uses appropriate scientific computing libraries (numpy, scipy, etc.)
+   - Is complete and runnable
+   - Follows best practices for scientific computing
+## Problem Categories
+The programming problems in this dataset focus on scientific computing concepts:
+- **Numerical Algorithms and Simulations**: Gradient descent, optimization, numerical integration
+- **Data Analysis and Visualization**: Statistical analysis, plotting, data processing
+- **Mathematical Modeling**: Linear regression, differential equations, statistical models
+- **Scientific Data Processing**: Molecular data, biological data, chemical data processing
+- **Computational Methods**: Methods in chemistry, biology, physics, and materials science
+## Generation Process
+The programming problems were generated through the following process:
+1. **Source Code Selection**: Functions were extracted from domain-specific repositories based on relevance scores
+2. **Context Preparation**: Source code snippets were prepared with project context
+3. **Prompt Engineering**: A structured prompt was used to guide the generation of programming problems
+4. **Problem Generation**: AI models generated self-contained problems inspired by (but not directly copying) the source code
+5. **Solution Generation**: Comprehensive solutions were generated for each problem
+6. **Quality Control**: Problems and solutions were validated for correctness and completeness
+### Key Characteristics
+- **Self-Contained**: Each problem includes all necessary context without requiring the original code
+- **Inspired, Not Copied**: Problems are inspired by source code but create new, interesting scenarios
+- **Complete Solutions**: Every problem includes a working, well-commented solution
+- **Domain-Specific**: Problems focus on scientific and technical domains
+- **Code-Inspired**: Problems are generated from real scientific computing code snippets
+## Usage Guidelines
+### Data Loading
+```python
+import jsonlines
+# Load the dataset
+problems = []
+with jsonlines.open('programming_problems.jsonl', 'r') as reader:
+    for obj in reader:
+        problems.append(obj)
+print(f"Total problems: {len(problems)}")
+```
+### Accessing Problem and Solution
+```python
+# Access a specific problem
+problem = problems[0]
+# Extract problem description and solution from response
+response = problem['response']
+# The response contains markdown with [Problem Description] and [Solution] sections
+# You can parse it to extract the problem and solution separately
+```
+### Extracting Problem Descriptions
+```python
+import re
+def extract_problem_description(response):
+    """Extract problem description from response."""
+    # Look for the Problem Description section
+    pattern = r'## Problem Description(.*?)(?=## Solution|$)'
+    match = re.search(pattern, response, re.DOTALL)
+    if match:
+        return match.group(1).strip()
+    return None
+def extract_solution(response):
+    """Extract solution code from response."""
+    # Look for code blocks in the Solution section
+    pattern = r'## Solution.*?```python\s*(.*?)```'
+    match = re.search(pattern, response, re.DOTALL)
+    if match:
+        return match.group(1).strip()
+    return None
+# Extract problem and solution
+for problem in problems[:5]:  # First 5 problems
+    problem_desc = extract_problem_description(problem['response'])
+    solution = extract_solution(problem['response'])
+    print(f"Problem: {problem['metadata']['function_name']}")
+    print(f"Description length: {len(problem_desc) if problem_desc else 0} chars")
+    print(f"Solution length: {len(solution) if solution else 0} chars")
+```
+### Filtering by Language
+```python
+# Filter problems based on source language
+python_problems = [
+    p for p in problems
+    if p['metadata'].get('language', '').lower() == 'python'
+]
+print(f"Python-based problems: {len(python_problems)}")
+```
+### Filtering by Relevance Score
+```python
+# Filter high-relevance problems
+high_relevance = [
+    p for p in problems
+    if p['metadata'].get('relevance_score', 0) >= 80
+]
+print(f"High-relevance problems: {len(high_relevance)}")
+```
+### Analyzing Token Usage
+```python
+# Analyze API usage statistics
+total_input_tokens = sum(p['usage']['input_tokens'] for p in problems)
+total_output_tokens = sum(p['usage']['output_tokens'] for p in problems)
+total_cost = sum(p['usage']['request_cost'] for p in problems)
+print(f"Total input tokens: {total_input_tokens:,}")
+print(f"Total output tokens: {total_output_tokens:,}")
+print(f"Total cost: ${total_cost:.4f}")
+```
+## Use Cases
+This dataset is suitable for:
+1. **Content Generation**: Creating programming exercises and problem sets
+2. **Code-to-Problem Generation**: Training models to generate problems from code
+3. **Problem-Solution Pairing**: Studying the relationship between problems and solutions
+4. **Scientific Computing Education**: Teaching numerical methods and scientific programming
+5. **Dataset Augmentation**: Expanding programming problem datasets
+6. **Code Understanding**: Training models to understand code semantics through problem generation
+7. **Automated Tutoring**: Building systems that generate practice problems
+## Important Notes
+1. **File Size**: The dataset file is moderately large (~496 MB), ensure sufficient memory when loading
+2. **JSONL Format**: Each line is a complete JSON object; process line-by-line for memory efficiency
+3. **Response Format**: The `response` field contains markdown-formatted text with problem and solution sections
+4. **Code Extraction**: Solutions are embedded in markdown code blocks; parsing may be needed to extract clean code
+5. **Metadata Completeness**: Some metadata fields (repo_name, path, language) may be empty for certain samples
+6. **Problem Independence**: Each problem is self-contained and does not require the original source code
+7. **Solution Correctness**: Solutions are AI-generated; validation may be needed for production use
+## Data Processing Example
+```python
+import jsonlines
+import re
+def parse_problem_response(response):
+    """Parse response into structured problem and solution."""
+    # Extract problem description
+    problem_match = re.search(
+        r'## Problem Description\s*\n(.*?)(?=\n## Solution|\Z)',
+        response,
+        re.DOTALL
+    )
+    problem_desc = problem_match.group(1).strip() if problem_match else None
+    # Extract solution code
+    solution_match = re.search(
+        r'```python\s*(.*?)```',
+        response,
+        re.DOTALL
+    )
+    solution_code = solution_match.group(1).strip() if solution_match else None
+    return {
+        'problem_description': problem_desc,
+        'solution_code': solution_code
+    }
+# Process dataset
+processed_problems = []
+with jsonlines.open('programming_problems.jsonl', 'r') as reader:
+    for obj in reader:
+        parsed = parse_problem_response(obj['response'])
+        processed_problems.append({
+            'function_name': obj['metadata']['function_name'],
+            'language': obj['metadata'].get('language', ''),
+            'relevance_score': obj['metadata'].get('relevance_score', 0),
+            'problem': parsed['problem_description'],
+            'solution': parsed['solution_code'],
+            'timestamp': obj['timestamp']
+        })
+print(f"Processed {len(processed_problems)} problems")
+```