dataset-builder / README.md
DouDou
Upload README.md with huggingface_hub
ddd7e5c verified
# Dataset Builder
This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research.
## Project Structure
```
dataset_builder/
β”œβ”€β”€ README.md # This file
β”‚
β”œβ”€β”€ data1/ # DATA1: Domain-Specific Code Dataset
β”‚ β”œβ”€β”€ main.py # Step 0-1: Keyword expansion + GitHub repo search
β”‚ β”œβ”€β”€ main_v2.py # Step 0-4: Full pipeline (search β†’ check β†’ clone β†’ filter)
β”‚ β”œβ”€β”€ util.py # Shared utilities (logger, LLM calls, code extensions)
β”‚ β”œβ”€β”€ download_dataset.py # Download ChemPile code dataset from HuggingFace
β”‚ β”œβ”€β”€ merge_dataset.py # Merge crawled repos with ChemPile data, deduplicate
β”‚ β”œβ”€β”€ analysis.py # Code-level analysis (comments, functions, tokens)
β”‚ β”œβ”€β”€ compute_stars_keywords.py # Compute stars/keyword statistics
β”‚ β”œβ”€β”€ compute_statistics.py # Compute code statistics from JSONL analysis files
β”‚ β”œβ”€β”€ rename.py # Rename repo directories to owner___repo format
β”‚ β”œβ”€β”€ rename2.py # Rename ChemPile files with zero-padded numbering
β”‚ β”œβ”€β”€ pyproject.toml # Python project config
β”‚ β”œβ”€β”€ scripts/
β”‚ β”‚ └── export_files_to_csv.py # Export repo files to CSV grouped by keyword
β”‚ β”œβ”€β”€ reporting/ # Statistical reporting and visualization
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ main.py # Reporting entry point
β”‚ β”‚ β”œβ”€β”€ visualization.py # Generate figures (funnel, distributions, etc.)
β”‚ β”‚ β”œβ”€β”€ repo_meta_scan.py # Scan repo-level metadata
β”‚ β”‚ β”œβ”€β”€ code_file_stats.py # File-level code statistics
β”‚ β”‚ β”œβ”€β”€ code_file_stats_fast.py # Optimized file-level statistics
β”‚ β”‚ β”œβ”€β”€ stage_a_stats.py # Stage A (search/check) statistics
β”‚ β”‚ β”œβ”€β”€ stage_b_stats.py # Stage B (clone/filter) statistics
β”‚ β”‚ └── join_insights.py # Join and cross-analyze insights
β”‚ └── README.md # DATA1 dataset documentation
β”‚
β”œβ”€β”€ data2/ # DATA2: Code-Documentation Alignment Dataset
β”‚ β”œβ”€β”€ instruction_generation/ # README summarization pipeline
β”‚ β”‚ β”œβ”€β”€ pipeline.py # Unified entry (summarize + parse modes)
β”‚ β”‚ β”œβ”€β”€ summarize_repo_readme.py # Summarize repo READMEs using LLM
β”‚ β”‚ β”œβ”€β”€ extract_repo_functions.py # Extract functions from repos
β”‚ β”‚ β”œβ”€β”€ schemas.py # Pydantic data schemas
β”‚ β”‚ └── prompts/
β”‚ β”‚ β”œβ”€β”€ function_extract.txt # Prompt for function extraction
β”‚ β”‚ └── readme_summary.txt # Prompt for README summarization
β”‚ β”œβ”€β”€ step22/ # Function scoring, generation, alignment
β”‚ β”‚ β”œβ”€β”€ build.py # Build tree-sitter language parsers
β”‚ β”‚ β”œβ”€β”€ func_stat.py # Extract functions using tree-sitter
β”‚ β”‚ β”œβ”€β”€ md_stat.py # Extract & save README summaries
β”‚ β”‚ β”œβ”€β”€ emb_qwen_func.py # Score functions using Qwen embedding model
β”‚ β”‚ β”œβ”€β”€ emb_qwen_md.py # Score READMEs using Qwen embedding model
β”‚ β”‚ β”œβ”€β”€ function_req.py # Filter functions by score threshold
β”‚ β”‚ β”œβ”€β”€ gemini_generation.py # Generate docstrings using Gemini API
β”‚ β”‚ β”œβ”€β”€ alignment.py # Align functions with generated docstrings
β”‚ β”‚ β”œβ”€β”€ prompt.txt # Prompt template for docstring generation
β”‚ β”‚ β”œβ”€β”€ depend_analysis.py # Dependency/call-graph analysis
β”‚ β”‚ β”œβ”€β”€ find_none_score_func.py # Find functions missing scores
β”‚ β”‚ β”œβ”€β”€ folder_stat.py # Repository folder statistics
β”‚ β”‚ β”œβ”€β”€ ppt.py # Visualization of alignment data
β”‚ β”‚ └── debug_parser.py # Debug tree-sitter parser loading
β”‚ └── README.md # DATA2 dataset documentation
β”‚
β”œβ”€β”€ data3/ # DATA3: Programming Problems Generation Dataset
β”‚ β”œβ”€β”€ main.py # RepoAgent: generate docs for repos
β”‚ β”œβ”€β”€ gemini.py # Gemini API connectivity test
β”‚ β”œβ”€β”€ load_dataset.py # Load and inspect datasets
β”‚ β”œβ”€β”€ instruct_generation.py # Score functions for scientific relevance
β”‚ β”œβ”€β”€ extract_functions.py # Extract functions from enhanced_dataset.csv
β”‚ β”œβ”€β”€ extract_functions_v2.py # Extract functions v2 (better CSV/JSON handling)
β”‚ β”œβ”€β”€ merge_datasets.py # Merge res2.csv with dataset_all.csv
β”‚ β”œβ”€β”€ generate_programming_problems.py # Generate problems using Gemini API
β”‚ β”œβ”€β”€ generate_problems_batch.py # Batch problem generation (OpenAI batch API)
β”‚ β”œβ”€β”€ generate_problems_openai.py # Problem generation via OpenAI API
β”‚ β”œβ”€β”€ enrich_programming_problems.py # Enrich problems with source code context
β”‚ β”œβ”€β”€ vllm_high.py # VLLM-based high-throughput inference
β”‚ β”œβ”€β”€ vllm_qwen_batch.py # Qwen model batch inference via VLLM
β”‚ β”œβ”€β”€ show_pricing.py # Display API pricing information
β”‚ β”œβ”€β”€ check_enhanced.py # Validate enhanced dataset
β”‚ β”œβ”€β”€ check_index_distribution.py # Check index distribution
β”‚ β”œβ”€β”€ check_match.py # Check data matching
β”‚ β”œβ”€β”€ check_relationship.py # Check data relationships
β”‚ β”œβ”€β”€ is_sci_prompt.txt # Prompt: classify code as scientific computing
β”‚ β”œβ”€β”€ is_sci_prompt1.txt # Prompt variant for scientific classification
β”‚ β”œβ”€β”€ score_prompt.txt # Prompt: score function relevance
β”‚ β”œβ”€β”€ *.sh # Various shell scripts for batch processing
β”‚ └── README.md # DATA3 dataset documentation
```
## Dataset Building Pipelines
### DATA1: Domain-Specific Code Dataset
**Goal**: Collect, filter, and export domain-specific code from GitHub repositories.
**Pipeline** (executed in order):
1. **Keyword Expansion & Search** (`main.py` / `main_v2.py`)
- Expand scientific keywords using LLM
- Search GitHub API for repositories matching keywords
- Check relevance using LLM (reads READMEs)
- Clone relevant repos (shallow clone)
- Filter to keep only code files
2. **External Data** (`download_dataset.py`)
- Download ChemPile code dataset from HuggingFace
3. **Merge & Deduplicate** (`merge_dataset.py`)
- Merge crawled repos with ChemPile data
- Deduplicate by content hash
4. **Analysis** (`analysis.py`, `compute_*.py`)
- Analyze code metrics (lines, comments, functions, tokens)
- Compute keyword and stars statistics
5. **Export** (`scripts/export_files_to_csv.py`)
- Export final dataset to CSV files grouped by keyword
6. **Reporting** (`reporting/`)
- Generate statistical reports and visualizations
### DATA2: Code-Documentation Alignment Dataset
**Goal**: Generate high-quality docstrings for scientific code functions.
**Pipeline** (executed in order):
1. **README Summarization** (`instruction_generation/`)
- Summarize repository READMEs using LLM
- Extract structured information from repos
2. **Function Extraction** (`step22/func_stat.py`)
- Parse code using tree-sitter to extract functions
- Multi-language support (Python, C, C++, Java, Go, Rust, Julia)
3. **README Processing** (`step22/md_stat.py`)
- Copy README summaries to function dataset directories
4. **Embedding Scoring** (`step22/emb_qwen_func.py`, `emb_qwen_md.py`)
- Score function quality using Qwen embedding model
- Score README quality using Qwen embedding model
5. **Function Filtering** (`step22/function_req.py`)
- Filter functions by combined quality score
6. **Docstring Generation** (`step22/gemini_generation.py`)
- Generate docstrings using Gemini API
- Budget monitoring with circuit breaker
- Checkpoint/resume support
7. **Alignment** (`step22/alignment.py`)
- Merge function data with generated docstrings
### DATA3: Programming Problems Generation Dataset
**Goal**: Generate programming problems inspired by scientific code.
**Pipeline** (executed in order):
1. **Documentation Generation** (`main.py`)
- Use RepoAgent to generate documentation for repositories
2. **Function Extraction** (`extract_functions.py`, `extract_functions_v2.py`)
- Extract individual functions from enhanced dataset
3. **Scientific Relevance Scoring** (`instruct_generation.py`)
- Score functions for scientific computing relevance
- Use `is_sci_prompt.txt` and `score_prompt.txt` as prompts
4. **Dataset Merge** (`merge_datasets.py`)
- Merge function scores with source code data
5. **Problem Generation** (`generate_programming_problems.py`)
- Generate programming problems using Gemini API
- Filter by relevance score
- Budget monitoring and cost control
6. **Enrichment** (`enrich_programming_problems.py`)
- Enrich generated problems with source code context
## Dependencies
### Common
- `pandas`, `tqdm`, `jsonlines`
- `python-dotenv`
### DATA1
- `langchain`, `langchain-openai`, `pydantic`, `loguru`
- `requests` (GitHub API)
- `matplotlib`, `seaborn`, `wordcloud` (reporting)
- `datasets` (HuggingFace)
### DATA2
- `tree-sitter`, `tree-sitter-python`, `tree-sitter-c`, etc.
- `vllm`, `transformers`, `torch` (embedding scoring)
- `google-cloud-aiplatform`, `vertexai` (Gemini API)
### DATA3
- `google-cloud-aiplatform`, `vertexai` (Gemini API)
- `openai` (OpenAI API)
- `vllm`, `transformers`, `torch` (local inference)
## Notes
- Scripts contain hardcoded paths that need to be updated for your environment
- API credentials (GitHub token, Gemini, OpenAI) need to be configured separately
- Large datasets require significant storage and compute resources
- Most scripts support checkpoint/resume for long-running processes