# Dataset Builder This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research. ## Project Structure ``` dataset_builder/ ├── README.md # This file │ ├── data1/ # DATA1: Domain-Specific Code Dataset │ ├── main.py # Step 0-1: Keyword expansion + GitHub repo search │ ├── main_v2.py # Step 0-4: Full pipeline (search → check → clone → filter) │ ├── util.py # Shared utilities (logger, LLM calls, code extensions) │ ├── download_dataset.py # Download ChemPile code dataset from HuggingFace │ ├── merge_dataset.py # Merge crawled repos with ChemPile data, deduplicate │ ├── analysis.py # Code-level analysis (comments, functions, tokens) │ ├── compute_stars_keywords.py # Compute stars/keyword statistics │ ├── compute_statistics.py # Compute code statistics from JSONL analysis files │ ├── rename.py # Rename repo directories to owner___repo format │ ├── rename2.py # Rename ChemPile files with zero-padded numbering │ ├── pyproject.toml # Python project config │ ├── scripts/ │ │ └── export_files_to_csv.py # Export repo files to CSV grouped by keyword │ ├── reporting/ # Statistical reporting and visualization │ │ ├── __init__.py │ │ ├── main.py # Reporting entry point │ │ ├── visualization.py # Generate figures (funnel, distributions, etc.) │ │ ├── repo_meta_scan.py # Scan repo-level metadata │ │ ├── code_file_stats.py # File-level code statistics │ │ ├── code_file_stats_fast.py # Optimized file-level statistics │ │ ├── stage_a_stats.py # Stage A (search/check) statistics │ │ ├── stage_b_stats.py # Stage B (clone/filter) statistics │ │ └── join_insights.py # Join and cross-analyze insights │ └── README.md # DATA1 dataset documentation │ ├── data2/ # DATA2: Code-Documentation Alignment Dataset │ ├── instruction_generation/ # README summarization pipeline │ │ ├── pipeline.py # Unified entry (summarize + parse modes) │ │ ├── summarize_repo_readme.py # Summarize repo READMEs using LLM │ │ ├── extract_repo_functions.py # Extract functions from repos │ │ ├── schemas.py # Pydantic data schemas │ │ └── prompts/ │ │ ├── function_extract.txt # Prompt for function extraction │ │ └── readme_summary.txt # Prompt for README summarization │ ├── step22/ # Function scoring, generation, alignment │ │ ├── build.py # Build tree-sitter language parsers │ │ ├── func_stat.py # Extract functions using tree-sitter │ │ ├── md_stat.py # Extract & save README summaries │ │ ├── emb_qwen_func.py # Score functions using Qwen embedding model │ │ ├── emb_qwen_md.py # Score READMEs using Qwen embedding model │ │ ├── function_req.py # Filter functions by score threshold │ │ ├── gemini_generation.py # Generate docstrings using Gemini API │ │ ├── alignment.py # Align functions with generated docstrings │ │ ├── prompt.txt # Prompt template for docstring generation │ │ ├── depend_analysis.py # Dependency/call-graph analysis │ │ ├── find_none_score_func.py # Find functions missing scores │ │ ├── folder_stat.py # Repository folder statistics │ │ ├── ppt.py # Visualization of alignment data │ │ └── debug_parser.py # Debug tree-sitter parser loading │ └── README.md # DATA2 dataset documentation │ ├── data3/ # DATA3: Programming Problems Generation Dataset │ ├── main.py # RepoAgent: generate docs for repos │ ├── gemini.py # Gemini API connectivity test │ ├── load_dataset.py # Load and inspect datasets │ ├── instruct_generation.py # Score functions for scientific relevance │ ├── extract_functions.py # Extract functions from enhanced_dataset.csv │ ├── extract_functions_v2.py # Extract functions v2 (better CSV/JSON handling) │ ├── merge_datasets.py # Merge res2.csv with dataset_all.csv │ ├── generate_programming_problems.py # Generate problems using Gemini API │ ├── generate_problems_batch.py # Batch problem generation (OpenAI batch API) │ ├── generate_problems_openai.py # Problem generation via OpenAI API │ ├── enrich_programming_problems.py # Enrich problems with source code context │ ├── vllm_high.py # VLLM-based high-throughput inference │ ├── vllm_qwen_batch.py # Qwen model batch inference via VLLM │ ├── show_pricing.py # Display API pricing information │ ├── check_enhanced.py # Validate enhanced dataset │ ├── check_index_distribution.py # Check index distribution │ ├── check_match.py # Check data matching │ ├── check_relationship.py # Check data relationships │ ├── is_sci_prompt.txt # Prompt: classify code as scientific computing │ ├── is_sci_prompt1.txt # Prompt variant for scientific classification │ ├── score_prompt.txt # Prompt: score function relevance │ ├── *.sh # Various shell scripts for batch processing │ └── README.md # DATA3 dataset documentation ``` ## Dataset Building Pipelines ### DATA1: Domain-Specific Code Dataset **Goal**: Collect, filter, and export domain-specific code from GitHub repositories. **Pipeline** (executed in order): 1. **Keyword Expansion & Search** (`main.py` / `main_v2.py`) - Expand scientific keywords using LLM - Search GitHub API for repositories matching keywords - Check relevance using LLM (reads READMEs) - Clone relevant repos (shallow clone) - Filter to keep only code files 2. **External Data** (`download_dataset.py`) - Download ChemPile code dataset from HuggingFace 3. **Merge & Deduplicate** (`merge_dataset.py`) - Merge crawled repos with ChemPile data - Deduplicate by content hash 4. **Analysis** (`analysis.py`, `compute_*.py`) - Analyze code metrics (lines, comments, functions, tokens) - Compute keyword and stars statistics 5. **Export** (`scripts/export_files_to_csv.py`) - Export final dataset to CSV files grouped by keyword 6. **Reporting** (`reporting/`) - Generate statistical reports and visualizations ### DATA2: Code-Documentation Alignment Dataset **Goal**: Generate high-quality docstrings for scientific code functions. **Pipeline** (executed in order): 1. **README Summarization** (`instruction_generation/`) - Summarize repository READMEs using LLM - Extract structured information from repos 2. **Function Extraction** (`step22/func_stat.py`) - Parse code using tree-sitter to extract functions - Multi-language support (Python, C, C++, Java, Go, Rust, Julia) 3. **README Processing** (`step22/md_stat.py`) - Copy README summaries to function dataset directories 4. **Embedding Scoring** (`step22/emb_qwen_func.py`, `emb_qwen_md.py`) - Score function quality using Qwen embedding model - Score README quality using Qwen embedding model 5. **Function Filtering** (`step22/function_req.py`) - Filter functions by combined quality score 6. **Docstring Generation** (`step22/gemini_generation.py`) - Generate docstrings using Gemini API - Budget monitoring with circuit breaker - Checkpoint/resume support 7. **Alignment** (`step22/alignment.py`) - Merge function data with generated docstrings ### DATA3: Programming Problems Generation Dataset **Goal**: Generate programming problems inspired by scientific code. **Pipeline** (executed in order): 1. **Documentation Generation** (`main.py`) - Use RepoAgent to generate documentation for repositories 2. **Function Extraction** (`extract_functions.py`, `extract_functions_v2.py`) - Extract individual functions from enhanced dataset 3. **Scientific Relevance Scoring** (`instruct_generation.py`) - Score functions for scientific computing relevance - Use `is_sci_prompt.txt` and `score_prompt.txt` as prompts 4. **Dataset Merge** (`merge_datasets.py`) - Merge function scores with source code data 5. **Problem Generation** (`generate_programming_problems.py`) - Generate programming problems using Gemini API - Filter by relevance score - Budget monitoring and cost control 6. **Enrichment** (`enrich_programming_problems.py`) - Enrich generated problems with source code context ## Dependencies ### Common - `pandas`, `tqdm`, `jsonlines` - `python-dotenv` ### DATA1 - `langchain`, `langchain-openai`, `pydantic`, `loguru` - `requests` (GitHub API) - `matplotlib`, `seaborn`, `wordcloud` (reporting) - `datasets` (HuggingFace) ### DATA2 - `tree-sitter`, `tree-sitter-python`, `tree-sitter-c`, etc. - `vllm`, `transformers`, `torch` (embedding scoring) - `google-cloud-aiplatform`, `vertexai` (Gemini API) ### DATA3 - `google-cloud-aiplatform`, `vertexai` (Gemini API) - `openai` (OpenAI API) - `vllm`, `transformers`, `torch` (local inference) ## Notes - Scripts contain hardcoded paths that need to be updated for your environment - API credentials (GitHub token, Gemini, OpenAI) need to be configured separately - Large datasets require significant storage and compute resources - Most scripts support checkpoint/resume for long-running processes