File size: 10,549 Bytes
ddd7e5c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 | # Dataset Builder
This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research.
## Project Structure
```
dataset_builder/
βββ README.md # This file
β
βββ data1/ # DATA1: Domain-Specific Code Dataset
β βββ main.py # Step 0-1: Keyword expansion + GitHub repo search
β βββ main_v2.py # Step 0-4: Full pipeline (search β check β clone β filter)
β βββ util.py # Shared utilities (logger, LLM calls, code extensions)
β βββ download_dataset.py # Download ChemPile code dataset from HuggingFace
β βββ merge_dataset.py # Merge crawled repos with ChemPile data, deduplicate
β βββ analysis.py # Code-level analysis (comments, functions, tokens)
β βββ compute_stars_keywords.py # Compute stars/keyword statistics
β βββ compute_statistics.py # Compute code statistics from JSONL analysis files
β βββ rename.py # Rename repo directories to owner___repo format
β βββ rename2.py # Rename ChemPile files with zero-padded numbering
β βββ pyproject.toml # Python project config
β βββ scripts/
β β βββ export_files_to_csv.py # Export repo files to CSV grouped by keyword
β βββ reporting/ # Statistical reporting and visualization
β β βββ __init__.py
β β βββ main.py # Reporting entry point
β β βββ visualization.py # Generate figures (funnel, distributions, etc.)
β β βββ repo_meta_scan.py # Scan repo-level metadata
β β βββ code_file_stats.py # File-level code statistics
β β βββ code_file_stats_fast.py # Optimized file-level statistics
β β βββ stage_a_stats.py # Stage A (search/check) statistics
β β βββ stage_b_stats.py # Stage B (clone/filter) statistics
β β βββ join_insights.py # Join and cross-analyze insights
β βββ README.md # DATA1 dataset documentation
β
βββ data2/ # DATA2: Code-Documentation Alignment Dataset
β βββ instruction_generation/ # README summarization pipeline
β β βββ pipeline.py # Unified entry (summarize + parse modes)
β β βββ summarize_repo_readme.py # Summarize repo READMEs using LLM
β β βββ extract_repo_functions.py # Extract functions from repos
β β βββ schemas.py # Pydantic data schemas
β β βββ prompts/
β β βββ function_extract.txt # Prompt for function extraction
β β βββ readme_summary.txt # Prompt for README summarization
β βββ step22/ # Function scoring, generation, alignment
β β βββ build.py # Build tree-sitter language parsers
β β βββ func_stat.py # Extract functions using tree-sitter
β β βββ md_stat.py # Extract & save README summaries
β β βββ emb_qwen_func.py # Score functions using Qwen embedding model
β β βββ emb_qwen_md.py # Score READMEs using Qwen embedding model
β β βββ function_req.py # Filter functions by score threshold
β β βββ gemini_generation.py # Generate docstrings using Gemini API
β β βββ alignment.py # Align functions with generated docstrings
β β βββ prompt.txt # Prompt template for docstring generation
β β βββ depend_analysis.py # Dependency/call-graph analysis
β β βββ find_none_score_func.py # Find functions missing scores
β β βββ folder_stat.py # Repository folder statistics
β β βββ ppt.py # Visualization of alignment data
β β βββ debug_parser.py # Debug tree-sitter parser loading
β βββ README.md # DATA2 dataset documentation
β
βββ data3/ # DATA3: Programming Problems Generation Dataset
β βββ main.py # RepoAgent: generate docs for repos
β βββ gemini.py # Gemini API connectivity test
β βββ load_dataset.py # Load and inspect datasets
β βββ instruct_generation.py # Score functions for scientific relevance
β βββ extract_functions.py # Extract functions from enhanced_dataset.csv
β βββ extract_functions_v2.py # Extract functions v2 (better CSV/JSON handling)
β βββ merge_datasets.py # Merge res2.csv with dataset_all.csv
β βββ generate_programming_problems.py # Generate problems using Gemini API
β βββ generate_problems_batch.py # Batch problem generation (OpenAI batch API)
β βββ generate_problems_openai.py # Problem generation via OpenAI API
β βββ enrich_programming_problems.py # Enrich problems with source code context
β βββ vllm_high.py # VLLM-based high-throughput inference
β βββ vllm_qwen_batch.py # Qwen model batch inference via VLLM
β βββ show_pricing.py # Display API pricing information
β βββ check_enhanced.py # Validate enhanced dataset
β βββ check_index_distribution.py # Check index distribution
β βββ check_match.py # Check data matching
β βββ check_relationship.py # Check data relationships
β βββ is_sci_prompt.txt # Prompt: classify code as scientific computing
β βββ is_sci_prompt1.txt # Prompt variant for scientific classification
β βββ score_prompt.txt # Prompt: score function relevance
β βββ *.sh # Various shell scripts for batch processing
β βββ README.md # DATA3 dataset documentation
```
## Dataset Building Pipelines
### DATA1: Domain-Specific Code Dataset
**Goal**: Collect, filter, and export domain-specific code from GitHub repositories.
**Pipeline** (executed in order):
1. **Keyword Expansion & Search** (`main.py` / `main_v2.py`)
- Expand scientific keywords using LLM
- Search GitHub API for repositories matching keywords
- Check relevance using LLM (reads READMEs)
- Clone relevant repos (shallow clone)
- Filter to keep only code files
2. **External Data** (`download_dataset.py`)
- Download ChemPile code dataset from HuggingFace
3. **Merge & Deduplicate** (`merge_dataset.py`)
- Merge crawled repos with ChemPile data
- Deduplicate by content hash
4. **Analysis** (`analysis.py`, `compute_*.py`)
- Analyze code metrics (lines, comments, functions, tokens)
- Compute keyword and stars statistics
5. **Export** (`scripts/export_files_to_csv.py`)
- Export final dataset to CSV files grouped by keyword
6. **Reporting** (`reporting/`)
- Generate statistical reports and visualizations
### DATA2: Code-Documentation Alignment Dataset
**Goal**: Generate high-quality docstrings for scientific code functions.
**Pipeline** (executed in order):
1. **README Summarization** (`instruction_generation/`)
- Summarize repository READMEs using LLM
- Extract structured information from repos
2. **Function Extraction** (`step22/func_stat.py`)
- Parse code using tree-sitter to extract functions
- Multi-language support (Python, C, C++, Java, Go, Rust, Julia)
3. **README Processing** (`step22/md_stat.py`)
- Copy README summaries to function dataset directories
4. **Embedding Scoring** (`step22/emb_qwen_func.py`, `emb_qwen_md.py`)
- Score function quality using Qwen embedding model
- Score README quality using Qwen embedding model
5. **Function Filtering** (`step22/function_req.py`)
- Filter functions by combined quality score
6. **Docstring Generation** (`step22/gemini_generation.py`)
- Generate docstrings using Gemini API
- Budget monitoring with circuit breaker
- Checkpoint/resume support
7. **Alignment** (`step22/alignment.py`)
- Merge function data with generated docstrings
### DATA3: Programming Problems Generation Dataset
**Goal**: Generate programming problems inspired by scientific code.
**Pipeline** (executed in order):
1. **Documentation Generation** (`main.py`)
- Use RepoAgent to generate documentation for repositories
2. **Function Extraction** (`extract_functions.py`, `extract_functions_v2.py`)
- Extract individual functions from enhanced dataset
3. **Scientific Relevance Scoring** (`instruct_generation.py`)
- Score functions for scientific computing relevance
- Use `is_sci_prompt.txt` and `score_prompt.txt` as prompts
4. **Dataset Merge** (`merge_datasets.py`)
- Merge function scores with source code data
5. **Problem Generation** (`generate_programming_problems.py`)
- Generate programming problems using Gemini API
- Filter by relevance score
- Budget monitoring and cost control
6. **Enrichment** (`enrich_programming_problems.py`)
- Enrich generated problems with source code context
## Dependencies
### Common
- `pandas`, `tqdm`, `jsonlines`
- `python-dotenv`
### DATA1
- `langchain`, `langchain-openai`, `pydantic`, `loguru`
- `requests` (GitHub API)
- `matplotlib`, `seaborn`, `wordcloud` (reporting)
- `datasets` (HuggingFace)
### DATA2
- `tree-sitter`, `tree-sitter-python`, `tree-sitter-c`, etc.
- `vllm`, `transformers`, `torch` (embedding scoring)
- `google-cloud-aiplatform`, `vertexai` (Gemini API)
### DATA3
- `google-cloud-aiplatform`, `vertexai` (Gemini API)
- `openai` (OpenAI API)
- `vllm`, `transformers`, `torch` (local inference)
## Notes
- Scripts contain hardcoded paths that need to be updated for your environment
- API credentials (GitHub token, Gemini, OpenAI) need to be configured separately
- Large datasets require significant storage and compute resources
- Most scripts support checkpoint/resume for long-running processes
|