File size: 10,549 Bytes
ddd7e5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
# Dataset Builder

This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research.

## Project Structure

```
dataset_builder/
β”œβ”€β”€ README.md                          # This file
β”‚
β”œβ”€β”€ data1/                             # DATA1: Domain-Specific Code Dataset
β”‚   β”œβ”€β”€ main.py                        # Step 0-1: Keyword expansion + GitHub repo search
β”‚   β”œβ”€β”€ main_v2.py                     # Step 0-4: Full pipeline (search β†’ check β†’ clone β†’ filter)
β”‚   β”œβ”€β”€ util.py                        # Shared utilities (logger, LLM calls, code extensions)
β”‚   β”œβ”€β”€ download_dataset.py            # Download ChemPile code dataset from HuggingFace
β”‚   β”œβ”€β”€ merge_dataset.py               # Merge crawled repos with ChemPile data, deduplicate
β”‚   β”œβ”€β”€ analysis.py                    # Code-level analysis (comments, functions, tokens)
β”‚   β”œβ”€β”€ compute_stars_keywords.py      # Compute stars/keyword statistics
β”‚   β”œβ”€β”€ compute_statistics.py          # Compute code statistics from JSONL analysis files
β”‚   β”œβ”€β”€ rename.py                      # Rename repo directories to owner___repo format
β”‚   β”œβ”€β”€ rename2.py                     # Rename ChemPile files with zero-padded numbering
β”‚   β”œβ”€β”€ pyproject.toml                 # Python project config
β”‚   β”œβ”€β”€ scripts/
β”‚   β”‚   └── export_files_to_csv.py     # Export repo files to CSV grouped by keyword
β”‚   β”œβ”€β”€ reporting/                     # Statistical reporting and visualization
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ main.py                    # Reporting entry point
β”‚   β”‚   β”œβ”€β”€ visualization.py           # Generate figures (funnel, distributions, etc.)
β”‚   β”‚   β”œβ”€β”€ repo_meta_scan.py          # Scan repo-level metadata
β”‚   β”‚   β”œβ”€β”€ code_file_stats.py         # File-level code statistics
β”‚   β”‚   β”œβ”€β”€ code_file_stats_fast.py    # Optimized file-level statistics
β”‚   β”‚   β”œβ”€β”€ stage_a_stats.py           # Stage A (search/check) statistics
β”‚   β”‚   β”œβ”€β”€ stage_b_stats.py           # Stage B (clone/filter) statistics
β”‚   β”‚   └── join_insights.py           # Join and cross-analyze insights
β”‚   └── README.md                      # DATA1 dataset documentation
β”‚
β”œβ”€β”€ data2/                             # DATA2: Code-Documentation Alignment Dataset
β”‚   β”œβ”€β”€ instruction_generation/        # README summarization pipeline
β”‚   β”‚   β”œβ”€β”€ pipeline.py                # Unified entry (summarize + parse modes)
β”‚   β”‚   β”œβ”€β”€ summarize_repo_readme.py   # Summarize repo READMEs using LLM
β”‚   β”‚   β”œβ”€β”€ extract_repo_functions.py  # Extract functions from repos
β”‚   β”‚   β”œβ”€β”€ schemas.py                 # Pydantic data schemas
β”‚   β”‚   └── prompts/
β”‚   β”‚       β”œβ”€β”€ function_extract.txt   # Prompt for function extraction
β”‚   β”‚       └── readme_summary.txt     # Prompt for README summarization
β”‚   β”œβ”€β”€ step22/                        # Function scoring, generation, alignment
β”‚   β”‚   β”œβ”€β”€ build.py                   # Build tree-sitter language parsers
β”‚   β”‚   β”œβ”€β”€ func_stat.py               # Extract functions using tree-sitter
β”‚   β”‚   β”œβ”€β”€ md_stat.py                 # Extract & save README summaries
β”‚   β”‚   β”œβ”€β”€ emb_qwen_func.py           # Score functions using Qwen embedding model
β”‚   β”‚   β”œβ”€β”€ emb_qwen_md.py             # Score READMEs using Qwen embedding model
β”‚   β”‚   β”œβ”€β”€ function_req.py            # Filter functions by score threshold
β”‚   β”‚   β”œβ”€β”€ gemini_generation.py       # Generate docstrings using Gemini API
β”‚   β”‚   β”œβ”€β”€ alignment.py               # Align functions with generated docstrings
β”‚   β”‚   β”œβ”€β”€ prompt.txt                 # Prompt template for docstring generation
β”‚   β”‚   β”œβ”€β”€ depend_analysis.py         # Dependency/call-graph analysis
β”‚   β”‚   β”œβ”€β”€ find_none_score_func.py    # Find functions missing scores
β”‚   β”‚   β”œβ”€β”€ folder_stat.py             # Repository folder statistics
β”‚   β”‚   β”œβ”€β”€ ppt.py                     # Visualization of alignment data
β”‚   β”‚   └── debug_parser.py            # Debug tree-sitter parser loading
β”‚   └── README.md                      # DATA2 dataset documentation
β”‚
β”œβ”€β”€ data3/                             # DATA3: Programming Problems Generation Dataset
β”‚   β”œβ”€β”€ main.py                        # RepoAgent: generate docs for repos
β”‚   β”œβ”€β”€ gemini.py                      # Gemini API connectivity test
β”‚   β”œβ”€β”€ load_dataset.py                # Load and inspect datasets
β”‚   β”œβ”€β”€ instruct_generation.py         # Score functions for scientific relevance
β”‚   β”œβ”€β”€ extract_functions.py           # Extract functions from enhanced_dataset.csv
β”‚   β”œβ”€β”€ extract_functions_v2.py        # Extract functions v2 (better CSV/JSON handling)
β”‚   β”œβ”€β”€ merge_datasets.py              # Merge res2.csv with dataset_all.csv
β”‚   β”œβ”€β”€ generate_programming_problems.py  # Generate problems using Gemini API
β”‚   β”œβ”€β”€ generate_problems_batch.py     # Batch problem generation (OpenAI batch API)
β”‚   β”œβ”€β”€ generate_problems_openai.py    # Problem generation via OpenAI API
β”‚   β”œβ”€β”€ enrich_programming_problems.py # Enrich problems with source code context
β”‚   β”œβ”€β”€ vllm_high.py                   # VLLM-based high-throughput inference
β”‚   β”œβ”€β”€ vllm_qwen_batch.py            # Qwen model batch inference via VLLM
β”‚   β”œβ”€β”€ show_pricing.py                # Display API pricing information
β”‚   β”œβ”€β”€ check_enhanced.py              # Validate enhanced dataset
β”‚   β”œβ”€β”€ check_index_distribution.py    # Check index distribution
β”‚   β”œβ”€β”€ check_match.py                 # Check data matching
β”‚   β”œβ”€β”€ check_relationship.py          # Check data relationships
β”‚   β”œβ”€β”€ is_sci_prompt.txt              # Prompt: classify code as scientific computing
β”‚   β”œβ”€β”€ is_sci_prompt1.txt             # Prompt variant for scientific classification
β”‚   β”œβ”€β”€ score_prompt.txt               # Prompt: score function relevance
β”‚   β”œβ”€β”€ *.sh                           # Various shell scripts for batch processing
β”‚   └── README.md                      # DATA3 dataset documentation
```

## Dataset Building Pipelines

### DATA1: Domain-Specific Code Dataset

**Goal**: Collect, filter, and export domain-specific code from GitHub repositories.

**Pipeline** (executed in order):

1. **Keyword Expansion & Search** (`main.py` / `main_v2.py`)
   - Expand scientific keywords using LLM
   - Search GitHub API for repositories matching keywords
   - Check relevance using LLM (reads READMEs)
   - Clone relevant repos (shallow clone)
   - Filter to keep only code files

2. **External Data** (`download_dataset.py`)
   - Download ChemPile code dataset from HuggingFace

3. **Merge & Deduplicate** (`merge_dataset.py`)
   - Merge crawled repos with ChemPile data
   - Deduplicate by content hash

4. **Analysis** (`analysis.py`, `compute_*.py`)
   - Analyze code metrics (lines, comments, functions, tokens)
   - Compute keyword and stars statistics

5. **Export** (`scripts/export_files_to_csv.py`)
   - Export final dataset to CSV files grouped by keyword

6. **Reporting** (`reporting/`)
   - Generate statistical reports and visualizations

### DATA2: Code-Documentation Alignment Dataset

**Goal**: Generate high-quality docstrings for scientific code functions.

**Pipeline** (executed in order):

1. **README Summarization** (`instruction_generation/`)
   - Summarize repository READMEs using LLM
   - Extract structured information from repos

2. **Function Extraction** (`step22/func_stat.py`)
   - Parse code using tree-sitter to extract functions
   - Multi-language support (Python, C, C++, Java, Go, Rust, Julia)

3. **README Processing** (`step22/md_stat.py`)
   - Copy README summaries to function dataset directories

4. **Embedding Scoring** (`step22/emb_qwen_func.py`, `emb_qwen_md.py`)
   - Score function quality using Qwen embedding model
   - Score README quality using Qwen embedding model

5. **Function Filtering** (`step22/function_req.py`)
   - Filter functions by combined quality score

6. **Docstring Generation** (`step22/gemini_generation.py`)
   - Generate docstrings using Gemini API
   - Budget monitoring with circuit breaker
   - Checkpoint/resume support

7. **Alignment** (`step22/alignment.py`)
   - Merge function data with generated docstrings

### DATA3: Programming Problems Generation Dataset

**Goal**: Generate programming problems inspired by scientific code.

**Pipeline** (executed in order):

1. **Documentation Generation** (`main.py`)
   - Use RepoAgent to generate documentation for repositories

2. **Function Extraction** (`extract_functions.py`, `extract_functions_v2.py`)
   - Extract individual functions from enhanced dataset

3. **Scientific Relevance Scoring** (`instruct_generation.py`)
   - Score functions for scientific computing relevance
   - Use `is_sci_prompt.txt` and `score_prompt.txt` as prompts

4. **Dataset Merge** (`merge_datasets.py`)
   - Merge function scores with source code data

5. **Problem Generation** (`generate_programming_problems.py`)
   - Generate programming problems using Gemini API
   - Filter by relevance score
   - Budget monitoring and cost control

6. **Enrichment** (`enrich_programming_problems.py`)
   - Enrich generated problems with source code context

## Dependencies

### Common
- `pandas`, `tqdm`, `jsonlines`
- `python-dotenv`

### DATA1
- `langchain`, `langchain-openai`, `pydantic`, `loguru`
- `requests` (GitHub API)
- `matplotlib`, `seaborn`, `wordcloud` (reporting)
- `datasets` (HuggingFace)

### DATA2
- `tree-sitter`, `tree-sitter-python`, `tree-sitter-c`, etc.
- `vllm`, `transformers`, `torch` (embedding scoring)
- `google-cloud-aiplatform`, `vertexai` (Gemini API)

### DATA3
- `google-cloud-aiplatform`, `vertexai` (Gemini API)
- `openai` (OpenAI API)
- `vllm`, `transformers`, `torch` (local inference)

## Notes

- Scripts contain hardcoded paths that need to be updated for your environment
- API credentials (GitHub token, Gemini, OpenAI) need to be configured separately
- Large datasets require significant storage and compute resources
- Most scripts support checkpoint/resume for long-running processes