File size: 10,549 Bytes

ddd7e5c

# Dataset Builder

This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research.

## Project Structure

```
dataset_builder/
├── README.md                          # This file
│
├── data1/                             # DATA1: Domain-Specific Code Dataset
│   ├── main.py                        # Step 0-1: Keyword expansion + GitHub repo search
│   ├── main_v2.py                     # Step 0-4: Full pipeline (search → check → clone → filter)
│   ├── util.py                        # Shared utilities (logger, LLM calls, code extensions)
│   ├── download_dataset.py            # Download ChemPile code dataset from HuggingFace
│   ├── merge_dataset.py               # Merge crawled repos with ChemPile data, deduplicate
│   ├── analysis.py                    # Code-level analysis (comments, functions, tokens)
│   ├── compute_stars_keywords.py      # Compute stars/keyword statistics
│   ├── compute_statistics.py          # Compute code statistics from JSONL analysis files
│   ├── rename.py                      # Rename repo directories to owner___repo format
│   ├── rename2.py                     # Rename ChemPile files with zero-padded numbering
│   ├── pyproject.toml                 # Python project config
│   ├── scripts/
│   │   └── export_files_to_csv.py     # Export repo files to CSV grouped by keyword
│   ├── reporting/                     # Statistical reporting and visualization
│   │   ├── __init__.py
│   │   ├── main.py                    # Reporting entry point
│   │   ├── visualization.py           # Generate figures (funnel, distributions, etc.)
│   │   ├── repo_meta_scan.py          # Scan repo-level metadata
│   │   ├── code_file_stats.py         # File-level code statistics
│   │   ├── code_file_stats_fast.py    # Optimized file-level statistics
│   │   ├── stage_a_stats.py           # Stage A (search/check) statistics
│   │   ├── stage_b_stats.py           # Stage B (clone/filter) statistics
│   │   └── join_insights.py           # Join and cross-analyze insights
│   └── README.md                      # DATA1 dataset documentation
│
├── data2/                             # DATA2: Code-Documentation Alignment Dataset
│   ├── instruction_generation/        # README summarization pipeline
│   │   ├── pipeline.py                # Unified entry (summarize + parse modes)
│   │   ├── summarize_repo_readme.py   # Summarize repo READMEs using LLM
│   │   ├── extract_repo_functions.py  # Extract functions from repos
│   │   ├── schemas.py                 # Pydantic data schemas
│   │   └── prompts/
│   │       ├── function_extract.txt   # Prompt for function extraction
│   │       └── readme_summary.txt     # Prompt for README summarization
│   ├── step22/                        # Function scoring, generation, alignment
│   │   ├── build.py                   # Build tree-sitter language parsers
│   │   ├── func_stat.py               # Extract functions using tree-sitter
│   │   ├── md_stat.py                 # Extract & save README summaries
│   │   ├── emb_qwen_func.py           # Score functions using Qwen embedding model
│   │   ├── emb_qwen_md.py             # Score READMEs using Qwen embedding model
│   │   ├── function_req.py            # Filter functions by score threshold
│   │   ├── gemini_generation.py       # Generate docstrings using Gemini API
│   │   ├── alignment.py               # Align functions with generated docstrings
│   │   ├── prompt.txt                 # Prompt template for docstring generation
│   │   ├── depend_analysis.py         # Dependency/call-graph analysis
│   │   ├── find_none_score_func.py    # Find functions missing scores
│   │   ├── folder_stat.py             # Repository folder statistics
│   │   ├── ppt.py                     # Visualization of alignment data
│   │   └── debug_parser.py            # Debug tree-sitter parser loading
│   └── README.md                      # DATA2 dataset documentation
│
├── data3/                             # DATA3: Programming Problems Generation Dataset
│   ├── main.py                        # RepoAgent: generate docs for repos
│   ├── gemini.py                      # Gemini API connectivity test
│   ├── load_dataset.py                # Load and inspect datasets
│   ├── instruct_generation.py         # Score functions for scientific relevance
│   ├── extract_functions.py           # Extract functions from enhanced_dataset.csv
│   ├── extract_functions_v2.py        # Extract functions v2 (better CSV/JSON handling)
│   ├── merge_datasets.py              # Merge res2.csv with dataset_all.csv
│   ├── generate_programming_problems.py  # Generate problems using Gemini API
│   ├── generate_problems_batch.py     # Batch problem generation (OpenAI batch API)
│   ├── generate_problems_openai.py    # Problem generation via OpenAI API
│   ├── enrich_programming_problems.py # Enrich problems with source code context
│   ├── vllm_high.py                   # VLLM-based high-throughput inference
│   ├── vllm_qwen_batch.py            # Qwen model batch inference via VLLM
│   ├── show_pricing.py                # Display API pricing information
│   ├── check_enhanced.py              # Validate enhanced dataset
│   ├── check_index_distribution.py    # Check index distribution
│   ├── check_match.py                 # Check data matching
│   ├── check_relationship.py          # Check data relationships
│   ├── is_sci_prompt.txt              # Prompt: classify code as scientific computing
│   ├── is_sci_prompt1.txt             # Prompt variant for scientific classification
│   ├── score_prompt.txt               # Prompt: score function relevance
│   ├── *.sh                           # Various shell scripts for batch processing
│   └── README.md                      # DATA3 dataset documentation
```

## Dataset Building Pipelines

### DATA1: Domain-Specific Code Dataset

**Goal**: Collect, filter, and export domain-specific code from GitHub repositories.

**Pipeline** (executed in order):

1. **Keyword Expansion & Search** (`main.py` / `main_v2.py`)
   - Expand scientific keywords using LLM
   - Search GitHub API for repositories matching keywords
   - Check relevance using LLM (reads READMEs)
   - Clone relevant repos (shallow clone)
   - Filter to keep only code files

2. **External Data** (`download_dataset.py`)
   - Download ChemPile code dataset from HuggingFace

3. **Merge & Deduplicate** (`merge_dataset.py`)
   - Merge crawled repos with ChemPile data
   - Deduplicate by content hash

4. **Analysis** (`analysis.py`, `compute_*.py`)
   - Analyze code metrics (lines, comments, functions, tokens)
   - Compute keyword and stars statistics

5. **Export** (`scripts/export_files_to_csv.py`)
   - Export final dataset to CSV files grouped by keyword

6. **Reporting** (`reporting/`)
   - Generate statistical reports and visualizations

### DATA2: Code-Documentation Alignment Dataset

**Goal**: Generate high-quality docstrings for scientific code functions.

**Pipeline** (executed in order):

1. **README Summarization** (`instruction_generation/`)
   - Summarize repository READMEs using LLM
   - Extract structured information from repos

2. **Function Extraction** (`step22/func_stat.py`)
   - Parse code using tree-sitter to extract functions
   - Multi-language support (Python, C, C++, Java, Go, Rust, Julia)

3. **README Processing** (`step22/md_stat.py`)
   - Copy README summaries to function dataset directories

4. **Embedding Scoring** (`step22/emb_qwen_func.py`, `emb_qwen_md.py`)
   - Score function quality using Qwen embedding model
   - Score README quality using Qwen embedding model

5. **Function Filtering** (`step22/function_req.py`)
   - Filter functions by combined quality score

6. **Docstring Generation** (`step22/gemini_generation.py`)
   - Generate docstrings using Gemini API
   - Budget monitoring with circuit breaker
   - Checkpoint/resume support

7. **Alignment** (`step22/alignment.py`)
   - Merge function data with generated docstrings

### DATA3: Programming Problems Generation Dataset

**Goal**: Generate programming problems inspired by scientific code.

**Pipeline** (executed in order):

1. **Documentation Generation** (`main.py`)
   - Use RepoAgent to generate documentation for repositories

2. **Function Extraction** (`extract_functions.py`, `extract_functions_v2.py`)
   - Extract individual functions from enhanced dataset

3. **Scientific Relevance Scoring** (`instruct_generation.py`)
   - Score functions for scientific computing relevance
   - Use `is_sci_prompt.txt` and `score_prompt.txt` as prompts

4. **Dataset Merge** (`merge_datasets.py`)
   - Merge function scores with source code data

5. **Problem Generation** (`generate_programming_problems.py`)
   - Generate programming problems using Gemini API
   - Filter by relevance score
   - Budget monitoring and cost control

6. **Enrichment** (`enrich_programming_problems.py`)
   - Enrich generated problems with source code context

## Dependencies

### Common
- `pandas`, `tqdm`, `jsonlines`
- `python-dotenv`

### DATA1
- `langchain`, `langchain-openai`, `pydantic`, `loguru`
- `requests` (GitHub API)
- `matplotlib`, `seaborn`, `wordcloud` (reporting)
- `datasets` (HuggingFace)

### DATA2
- `tree-sitter`, `tree-sitter-python`, `tree-sitter-c`, etc.
- `vllm`, `transformers`, `torch` (embedding scoring)
- `google-cloud-aiplatform`, `vertexai` (Gemini API)

### DATA3
- `google-cloud-aiplatform`, `vertexai` (Gemini API)
- `openai` (OpenAI API)
- `vllm`, `transformers`, `torch` (local inference)

## Notes

- Scripts contain hardcoded paths that need to be updated for your environment
- API credentials (GitHub token, Gemini, OpenAI) need to be configured separately
- Large datasets require significant storage and compute resources
- Most scripts support checkpoint/resume for long-running processes