| # Dataset Builder | |
| This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research. | |
| ## Project Structure | |
| ``` | |
| dataset_builder/ | |
| βββ README.md # This file | |
| β | |
| βββ data1/ # DATA1: Domain-Specific Code Dataset | |
| β βββ main.py # Step 0-1: Keyword expansion + GitHub repo search | |
| β βββ main_v2.py # Step 0-4: Full pipeline (search β check β clone β filter) | |
| β βββ util.py # Shared utilities (logger, LLM calls, code extensions) | |
| β βββ download_dataset.py # Download ChemPile code dataset from HuggingFace | |
| β βββ merge_dataset.py # Merge crawled repos with ChemPile data, deduplicate | |
| β βββ analysis.py # Code-level analysis (comments, functions, tokens) | |
| β βββ compute_stars_keywords.py # Compute stars/keyword statistics | |
| β βββ compute_statistics.py # Compute code statistics from JSONL analysis files | |
| β βββ rename.py # Rename repo directories to owner___repo format | |
| β βββ rename2.py # Rename ChemPile files with zero-padded numbering | |
| β βββ pyproject.toml # Python project config | |
| β βββ scripts/ | |
| β β βββ export_files_to_csv.py # Export repo files to CSV grouped by keyword | |
| β βββ reporting/ # Statistical reporting and visualization | |
| β β βββ __init__.py | |
| β β βββ main.py # Reporting entry point | |
| β β βββ visualization.py # Generate figures (funnel, distributions, etc.) | |
| β β βββ repo_meta_scan.py # Scan repo-level metadata | |
| β β βββ code_file_stats.py # File-level code statistics | |
| β β βββ code_file_stats_fast.py # Optimized file-level statistics | |
| β β βββ stage_a_stats.py # Stage A (search/check) statistics | |
| β β βββ stage_b_stats.py # Stage B (clone/filter) statistics | |
| β β βββ join_insights.py # Join and cross-analyze insights | |
| β βββ README.md # DATA1 dataset documentation | |
| β | |
| βββ data2/ # DATA2: Code-Documentation Alignment Dataset | |
| β βββ instruction_generation/ # README summarization pipeline | |
| β β βββ pipeline.py # Unified entry (summarize + parse modes) | |
| β β βββ summarize_repo_readme.py # Summarize repo READMEs using LLM | |
| β β βββ extract_repo_functions.py # Extract functions from repos | |
| β β βββ schemas.py # Pydantic data schemas | |
| β β βββ prompts/ | |
| β β βββ function_extract.txt # Prompt for function extraction | |
| β β βββ readme_summary.txt # Prompt for README summarization | |
| β βββ step22/ # Function scoring, generation, alignment | |
| β β βββ build.py # Build tree-sitter language parsers | |
| β β βββ func_stat.py # Extract functions using tree-sitter | |
| β β βββ md_stat.py # Extract & save README summaries | |
| β β βββ emb_qwen_func.py # Score functions using Qwen embedding model | |
| β β βββ emb_qwen_md.py # Score READMEs using Qwen embedding model | |
| β β βββ function_req.py # Filter functions by score threshold | |
| β β βββ gemini_generation.py # Generate docstrings using Gemini API | |
| β β βββ alignment.py # Align functions with generated docstrings | |
| β β βββ prompt.txt # Prompt template for docstring generation | |
| β β βββ depend_analysis.py # Dependency/call-graph analysis | |
| β β βββ find_none_score_func.py # Find functions missing scores | |
| β β βββ folder_stat.py # Repository folder statistics | |
| β β βββ ppt.py # Visualization of alignment data | |
| β β βββ debug_parser.py # Debug tree-sitter parser loading | |
| β βββ README.md # DATA2 dataset documentation | |
| β | |
| βββ data3/ # DATA3: Programming Problems Generation Dataset | |
| β βββ main.py # RepoAgent: generate docs for repos | |
| β βββ gemini.py # Gemini API connectivity test | |
| β βββ load_dataset.py # Load and inspect datasets | |
| β βββ instruct_generation.py # Score functions for scientific relevance | |
| β βββ extract_functions.py # Extract functions from enhanced_dataset.csv | |
| β βββ extract_functions_v2.py # Extract functions v2 (better CSV/JSON handling) | |
| β βββ merge_datasets.py # Merge res2.csv with dataset_all.csv | |
| β βββ generate_programming_problems.py # Generate problems using Gemini API | |
| β βββ generate_problems_batch.py # Batch problem generation (OpenAI batch API) | |
| β βββ generate_problems_openai.py # Problem generation via OpenAI API | |
| β βββ enrich_programming_problems.py # Enrich problems with source code context | |
| β βββ vllm_high.py # VLLM-based high-throughput inference | |
| β βββ vllm_qwen_batch.py # Qwen model batch inference via VLLM | |
| β βββ show_pricing.py # Display API pricing information | |
| β βββ check_enhanced.py # Validate enhanced dataset | |
| β βββ check_index_distribution.py # Check index distribution | |
| β βββ check_match.py # Check data matching | |
| β βββ check_relationship.py # Check data relationships | |
| β βββ is_sci_prompt.txt # Prompt: classify code as scientific computing | |
| β βββ is_sci_prompt1.txt # Prompt variant for scientific classification | |
| β βββ score_prompt.txt # Prompt: score function relevance | |
| β βββ *.sh # Various shell scripts for batch processing | |
| β βββ README.md # DATA3 dataset documentation | |
| ``` | |
| ## Dataset Building Pipelines | |
| ### DATA1: Domain-Specific Code Dataset | |
| **Goal**: Collect, filter, and export domain-specific code from GitHub repositories. | |
| **Pipeline** (executed in order): | |
| 1. **Keyword Expansion & Search** (`main.py` / `main_v2.py`) | |
| - Expand scientific keywords using LLM | |
| - Search GitHub API for repositories matching keywords | |
| - Check relevance using LLM (reads READMEs) | |
| - Clone relevant repos (shallow clone) | |
| - Filter to keep only code files | |
| 2. **External Data** (`download_dataset.py`) | |
| - Download ChemPile code dataset from HuggingFace | |
| 3. **Merge & Deduplicate** (`merge_dataset.py`) | |
| - Merge crawled repos with ChemPile data | |
| - Deduplicate by content hash | |
| 4. **Analysis** (`analysis.py`, `compute_*.py`) | |
| - Analyze code metrics (lines, comments, functions, tokens) | |
| - Compute keyword and stars statistics | |
| 5. **Export** (`scripts/export_files_to_csv.py`) | |
| - Export final dataset to CSV files grouped by keyword | |
| 6. **Reporting** (`reporting/`) | |
| - Generate statistical reports and visualizations | |
| ### DATA2: Code-Documentation Alignment Dataset | |
| **Goal**: Generate high-quality docstrings for scientific code functions. | |
| **Pipeline** (executed in order): | |
| 1. **README Summarization** (`instruction_generation/`) | |
| - Summarize repository READMEs using LLM | |
| - Extract structured information from repos | |
| 2. **Function Extraction** (`step22/func_stat.py`) | |
| - Parse code using tree-sitter to extract functions | |
| - Multi-language support (Python, C, C++, Java, Go, Rust, Julia) | |
| 3. **README Processing** (`step22/md_stat.py`) | |
| - Copy README summaries to function dataset directories | |
| 4. **Embedding Scoring** (`step22/emb_qwen_func.py`, `emb_qwen_md.py`) | |
| - Score function quality using Qwen embedding model | |
| - Score README quality using Qwen embedding model | |
| 5. **Function Filtering** (`step22/function_req.py`) | |
| - Filter functions by combined quality score | |
| 6. **Docstring Generation** (`step22/gemini_generation.py`) | |
| - Generate docstrings using Gemini API | |
| - Budget monitoring with circuit breaker | |
| - Checkpoint/resume support | |
| 7. **Alignment** (`step22/alignment.py`) | |
| - Merge function data with generated docstrings | |
| ### DATA3: Programming Problems Generation Dataset | |
| **Goal**: Generate programming problems inspired by scientific code. | |
| **Pipeline** (executed in order): | |
| 1. **Documentation Generation** (`main.py`) | |
| - Use RepoAgent to generate documentation for repositories | |
| 2. **Function Extraction** (`extract_functions.py`, `extract_functions_v2.py`) | |
| - Extract individual functions from enhanced dataset | |
| 3. **Scientific Relevance Scoring** (`instruct_generation.py`) | |
| - Score functions for scientific computing relevance | |
| - Use `is_sci_prompt.txt` and `score_prompt.txt` as prompts | |
| 4. **Dataset Merge** (`merge_datasets.py`) | |
| - Merge function scores with source code data | |
| 5. **Problem Generation** (`generate_programming_problems.py`) | |
| - Generate programming problems using Gemini API | |
| - Filter by relevance score | |
| - Budget monitoring and cost control | |
| 6. **Enrichment** (`enrich_programming_problems.py`) | |
| - Enrich generated problems with source code context | |
| ## Dependencies | |
| ### Common | |
| - `pandas`, `tqdm`, `jsonlines` | |
| - `python-dotenv` | |
| ### DATA1 | |
| - `langchain`, `langchain-openai`, `pydantic`, `loguru` | |
| - `requests` (GitHub API) | |
| - `matplotlib`, `seaborn`, `wordcloud` (reporting) | |
| - `datasets` (HuggingFace) | |
| ### DATA2 | |
| - `tree-sitter`, `tree-sitter-python`, `tree-sitter-c`, etc. | |
| - `vllm`, `transformers`, `torch` (embedding scoring) | |
| - `google-cloud-aiplatform`, `vertexai` (Gemini API) | |
| ### DATA3 | |
| - `google-cloud-aiplatform`, `vertexai` (Gemini API) | |
| - `openai` (OpenAI API) | |
| - `vllm`, `transformers`, `torch` (local inference) | |
| ## Notes | |
| - Scripts contain hardcoded paths that need to be updated for your environment | |
| - API credentials (GitHub token, Gemini, OpenAI) need to be configured separately | |
| - Large datasets require significant storage and compute resources | |
| - Most scripts support checkpoint/resume for long-running processes | |