DouDou commited on
Commit
5c31870
·
verified ·
1 Parent(s): bffe782

Upload data3/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. data3/README.md +260 -0
data3/README.md ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DATA3: Programming Problems Generation Dataset
2
+
3
+ ## Dataset Overview
4
+
5
+ DATA3 is a large-scale programming problems generation dataset that contains AI-generated programming problems inspired by real scientific computing code snippets. The dataset consists of 22,532 programming problems, each paired with a comprehensive solution. These problems focus on scientific computing concepts such as numerical algorithms, data analysis, mathematical modeling, and computational methods in chemistry, biology, and physics.
6
+
7
+ ## Dataset Statistics
8
+
9
+ - **Total Samples**: 22,532 programming problems
10
+ - **Total Data Size**: ~496 MB
11
+ - **Data Format**: JSONL (JSON Lines, one JSON object per line)
12
+ - **Encoding**: UTF-8
13
+ - **Primary Language**: Python (dominant in source code)
14
+ - **Average Input Tokens**: ~697 tokens per prompt
15
+ - **Average Output Tokens**: ~5,378 tokens per response
16
+
17
+ ## Dataset Structure
18
+
19
+ The dataset is stored in JSONL format, where each line contains a complete JSON object representing one programming problem with its solution.
20
+
21
+ ### Data Field Description
22
+
23
+ Each JSON object contains the following fields:
24
+
25
+ | Field Name | Type | Description |
26
+ |------------|------|-------------|
27
+ | `metadata` | Object | Metadata about the source code that inspired the problem |
28
+ | `metadata.original_index` | String | Original index of the source function |
29
+ | `metadata.function_name` | String | Name of the source function |
30
+ | `metadata.repo_name` | String | Repository name (may be empty) |
31
+ | `metadata.path` | String | File path (may be empty) |
32
+ | `metadata.language` | String | Programming language of source code |
33
+ | `metadata.relevance_score` | Integer | Relevance score of the source function |
34
+ | `metadata.function_start_line` | String | Starting line number of the function |
35
+ | `metadata.function_end_line` | String | Ending line number of the function |
36
+ | `prompt` | String | The prompt used to generate the programming problem |
37
+ | `response` | String | Generated response containing problem description and solution |
38
+ | `usage` | Object | API usage statistics for generation |
39
+ | `usage.input_tokens` | Integer | Number of input tokens used |
40
+ | `usage.output_tokens` | Integer | Number of output tokens generated |
41
+ | `usage.total_tokens` | Integer | Total tokens (input + output) |
42
+ | `usage.input_cost` | Float | Cost for input tokens |
43
+ | `usage.output_cost` | Float | Cost for output tokens |
44
+ | `usage.request_cost` | Float | Total cost for the request |
45
+ | `timestamp` | String | ISO format timestamp of generation |
46
+ | `row_number` | Integer | Row number in the dataset |
47
+
48
+ ### Response Structure
49
+
50
+ The `response` field contains a structured markdown document with two main sections:
51
+
52
+ 1. **Problem Description**: A self-contained problem description that:
53
+ - Provides all necessary context and background
54
+ - Clearly states what needs to be implemented
55
+ - Specifies input/output format and constraints
56
+ - Explains domain-specific concepts
57
+ - Does NOT directly reference the original code snippet
58
+
59
+ 2. **Solution**: A comprehensive Python solution that:
60
+ - Accurately solves the problem
61
+ - Includes clear comments explaining the approach
62
+ - Uses appropriate scientific computing libraries (numpy, scipy, etc.)
63
+ - Is complete and runnable
64
+ - Follows best practices for scientific computing
65
+
66
+ ## Problem Categories
67
+
68
+ The programming problems in this dataset focus on scientific computing concepts:
69
+
70
+ - **Numerical Algorithms and Simulations**: Gradient descent, optimization, numerical integration
71
+ - **Data Analysis and Visualization**: Statistical analysis, plotting, data processing
72
+ - **Mathematical Modeling**: Linear regression, differential equations, statistical models
73
+ - **Scientific Data Processing**: Molecular data, biological data, chemical data processing
74
+ - **Computational Methods**: Methods in chemistry, biology, physics, and materials science
75
+
76
+ ## Generation Process
77
+
78
+ The programming problems were generated through the following process:
79
+
80
+ 1. **Source Code Selection**: Functions were extracted from domain-specific repositories based on relevance scores
81
+ 2. **Context Preparation**: Source code snippets were prepared with project context
82
+ 3. **Prompt Engineering**: A structured prompt was used to guide the generation of programming problems
83
+ 4. **Problem Generation**: AI models generated self-contained problems inspired by (but not directly copying) the source code
84
+ 5. **Solution Generation**: Comprehensive solutions were generated for each problem
85
+ 6. **Quality Control**: Problems and solutions were validated for correctness and completeness
86
+
87
+ ### Key Characteristics
88
+
89
+ - **Self-Contained**: Each problem includes all necessary context without requiring the original code
90
+ - **Inspired, Not Copied**: Problems are inspired by source code but create new, interesting scenarios
91
+ - **Complete Solutions**: Every problem includes a working, well-commented solution
92
+ - **Domain-Specific**: Problems focus on scientific and technical domains
93
+ - **Code-Inspired**: Problems are generated from real scientific computing code snippets
94
+
95
+ ## Usage Guidelines
96
+
97
+ ### Data Loading
98
+
99
+ ```python
100
+ import jsonlines
101
+
102
+ # Load the dataset
103
+ problems = []
104
+ with jsonlines.open('programming_problems.jsonl', 'r') as reader:
105
+ for obj in reader:
106
+ problems.append(obj)
107
+
108
+ print(f"Total problems: {len(problems)}")
109
+ ```
110
+
111
+ ### Accessing Problem and Solution
112
+
113
+ ```python
114
+ # Access a specific problem
115
+ problem = problems[0]
116
+
117
+ # Extract problem description and solution from response
118
+ response = problem['response']
119
+
120
+ # The response contains markdown with [Problem Description] and [Solution] sections
121
+ # You can parse it to extract the problem and solution separately
122
+ ```
123
+
124
+ ### Extracting Problem Descriptions
125
+
126
+ ```python
127
+ import re
128
+
129
+ def extract_problem_description(response):
130
+ """Extract problem description from response."""
131
+ # Look for the Problem Description section
132
+ pattern = r'## Problem Description(.*?)(?=## Solution|$)'
133
+ match = re.search(pattern, response, re.DOTALL)
134
+ if match:
135
+ return match.group(1).strip()
136
+ return None
137
+
138
+ def extract_solution(response):
139
+ """Extract solution code from response."""
140
+ # Look for code blocks in the Solution section
141
+ pattern = r'## Solution.*?```python\s*(.*?)```'
142
+ match = re.search(pattern, response, re.DOTALL)
143
+ if match:
144
+ return match.group(1).strip()
145
+ return None
146
+
147
+ # Extract problem and solution
148
+ for problem in problems[:5]: # First 5 problems
149
+ problem_desc = extract_problem_description(problem['response'])
150
+ solution = extract_solution(problem['response'])
151
+ print(f"Problem: {problem['metadata']['function_name']}")
152
+ print(f"Description length: {len(problem_desc) if problem_desc else 0} chars")
153
+ print(f"Solution length: {len(solution) if solution else 0} chars")
154
+ ```
155
+
156
+ ### Filtering by Language
157
+
158
+ ```python
159
+ # Filter problems based on source language
160
+ python_problems = [
161
+ p for p in problems
162
+ if p['metadata'].get('language', '').lower() == 'python'
163
+ ]
164
+
165
+ print(f"Python-based problems: {len(python_problems)}")
166
+ ```
167
+
168
+ ### Filtering by Relevance Score
169
+
170
+ ```python
171
+ # Filter high-relevance problems
172
+ high_relevance = [
173
+ p for p in problems
174
+ if p['metadata'].get('relevance_score', 0) >= 80
175
+ ]
176
+
177
+ print(f"High-relevance problems: {len(high_relevance)}")
178
+ ```
179
+
180
+ ### Analyzing Token Usage
181
+
182
+ ```python
183
+ # Analyze API usage statistics
184
+ total_input_tokens = sum(p['usage']['input_tokens'] for p in problems)
185
+ total_output_tokens = sum(p['usage']['output_tokens'] for p in problems)
186
+ total_cost = sum(p['usage']['request_cost'] for p in problems)
187
+
188
+ print(f"Total input tokens: {total_input_tokens:,}")
189
+ print(f"Total output tokens: {total_output_tokens:,}")
190
+ print(f"Total cost: ${total_cost:.4f}")
191
+ ```
192
+
193
+ ## Use Cases
194
+
195
+ This dataset is suitable for:
196
+
197
+ 1. **Content Generation**: Creating programming exercises and problem sets
198
+ 2. **Code-to-Problem Generation**: Training models to generate problems from code
199
+ 3. **Problem-Solution Pairing**: Studying the relationship between problems and solutions
200
+ 4. **Scientific Computing Education**: Teaching numerical methods and scientific programming
201
+ 5. **Dataset Augmentation**: Expanding programming problem datasets
202
+ 6. **Code Understanding**: Training models to understand code semantics through problem generation
203
+ 7. **Automated Tutoring**: Building systems that generate practice problems
204
+
205
+ ## Important Notes
206
+
207
+ 1. **File Size**: The dataset file is moderately large (~496 MB), ensure sufficient memory when loading
208
+ 2. **JSONL Format**: Each line is a complete JSON object; process line-by-line for memory efficiency
209
+ 3. **Response Format**: The `response` field contains markdown-formatted text with problem and solution sections
210
+ 4. **Code Extraction**: Solutions are embedded in markdown code blocks; parsing may be needed to extract clean code
211
+ 5. **Metadata Completeness**: Some metadata fields (repo_name, path, language) may be empty for certain samples
212
+ 6. **Problem Independence**: Each problem is self-contained and does not require the original source code
213
+ 7. **Solution Correctness**: Solutions are AI-generated; validation may be needed for production use
214
+
215
+ ## Data Processing Example
216
+
217
+ ```python
218
+ import jsonlines
219
+ import re
220
+
221
+ def parse_problem_response(response):
222
+ """Parse response into structured problem and solution."""
223
+ # Extract problem description
224
+ problem_match = re.search(
225
+ r'## Problem Description\s*\n(.*?)(?=\n## Solution|\Z)',
226
+ response,
227
+ re.DOTALL
228
+ )
229
+ problem_desc = problem_match.group(1).strip() if problem_match else None
230
+
231
+ # Extract solution code
232
+ solution_match = re.search(
233
+ r'```python\s*(.*?)```',
234
+ response,
235
+ re.DOTALL
236
+ )
237
+ solution_code = solution_match.group(1).strip() if solution_match else None
238
+
239
+ return {
240
+ 'problem_description': problem_desc,
241
+ 'solution_code': solution_code
242
+ }
243
+
244
+ # Process dataset
245
+ processed_problems = []
246
+ with jsonlines.open('programming_problems.jsonl', 'r') as reader:
247
+ for obj in reader:
248
+ parsed = parse_problem_response(obj['response'])
249
+ processed_problems.append({
250
+ 'function_name': obj['metadata']['function_name'],
251
+ 'language': obj['metadata'].get('language', ''),
252
+ 'relevance_score': obj['metadata'].get('relevance_score', 0),
253
+ 'problem': parsed['problem_description'],
254
+ 'solution': parsed['solution_code'],
255
+ 'timestamp': obj['timestamp']
256
+ })
257
+
258
+ print(f"Processed {len(processed_problems)} problems")
259
+ ```
260
+