SciCode
/

dataset-builder

Model card Files Files and versions

xet

Community

DouDou commited on Feb 19

Commit

e5d8191

verified ·

1 Parent(s): ddd7e5c

Upload data1/README.md with huggingface_hub

Browse files

Files changed (1) hide show

data1/README.md +133 -0

data1/README.md ADDED Viewed

	@@ -0,0 +1,133 @@

+# DATA1: Domain-Specific Code Dataset
+## Dataset Overview
+DATA1 is a large-scale domain-specific code dataset focusing on code samples from interdisciplinary fields such as biology, chemistry, materials science, and related areas. The dataset is collected and organized from GitHub repositories, covering 178 different domain topics with over 1.1 billion lines of code.
+## Dataset Statistics
+- **Total Datasets**: 178 CSV files
+- **Total Data Size**: ~115 GB
+- **Total Lines of Code**: Over 1.1 billion lines
+- **Data Format**: CSV (Comma-Separated Values)
+- **Encoding**: UTF-8
+## Dataset Structure
+Each CSV file corresponds to a specific domain topic, with the naming format `dataset_{Topic}.csv`, where `{Topic}` is the domain keyword (e.g., Protein, Drug, Genomics).
+### Data Field Description
+Each CSV file contains the following fields:
+| Field Name | Type | Description |
+|------------|------|-------------|
+| `keyword` | String | Domain keyword used to identify the domain of the code sample |
+| `repo_name` | String | GitHub repository name (format: owner/repo) |
+| `file_path` | String | Relative path of the file in the repository |
+| `file_extension` | String | File extension (e.g., .py, .java, .cpp) |
+| `file_size` | Integer | File size in bytes |
+| `line_count` | Integer | Number of lines of code in the file |
+| `content` | String | Complete file content |
+| `language` | String | Programming language (e.g., Python, Java, C++) |
+## Domain Categories
+The dataset covers the following major domain categories:
+### Biology-Related
+- **Molecular Biology**: Protein, DNA, RNA, Gene, Enzyme, Receptor, Ligand
+- **Cell Biology**: Cell_biology, Single_cell, Cell_atlas, Organoid
+- **Genomics**: Genomics, Genotype, Phenotype, Epigenetics, Metagenomics
+- **Transcriptomics**: Transcriptomics, Spatial_Transcriptomics, Transcription, Translation
+- **Proteomics**: Proteomics, Protein_Protein_Interactions, Folding
+- **Metabolomics**: Metabolomics, Metabolic, Lipidomics, Glycomics
+- **Systems Biology**: System_biology, Signaling, Pathway, Networks
+### Chemistry-Related
+- **Computational Chemistry**: Computational_Chemistry, Quantum_Chemistry, DFT, QM_MM
+- **Medicinal Chemistry**: Drug, ADMET, QSAR, Docking, Lead_discovery, Lead_optimization
+- **Materials Chemistry**: Material, Crystal, Conformation, Chemical_space
+- **Reaction Chemistry**: Reaction, Kinetics, Mechanism, Redox
+### Medicine and Pharmacology
+- **Pharmacology**: Pharmacology, Pharmacokinetics, Pharmacogenomics, Pharmacogenetics
+- **Medicine**: Medicine, Disease, Diagnostics, Pathology, Vaccine
+- **Toxicology**: Toxicology, Biomarker, Marker
+### Computational Methods
+- **Machine Learning**: Transformer, GAN, VAE, Diffusion, Flow_matching, Reinforcement_learning
+- **Quantum Computing**: Quantum_mechanics, Quantum_biology, Electronic_structure
+- **Modeling Methods**: Modeling, Multi_scale_modeling, Agent_based_model, Stochastic_modeling
+- **Numerical Methods**: Monte_Carlo, Finite_element_method, Phase_field_technique
+### Other Specialized Fields
+- **Bioinformatics**: Bioinformatics, Cheminformatics, Next_generation_sequencing
+- **Bioengineering**: Bioengineering, Biotechnology, Biosensors
+- **Immunology**: Immunology, Antibody, Antigen, Antagonist
+- **Virology**: Viral, Pandemic, Pathogens, AMR (Antimicrobial Resistance)
+## Data Source
+The data is collected from open-source repositories on GitHub through the following process:
+1. **Keyword Search**: Search for relevant repositories on GitHub using domain-specific keywords
+2. **Repository Filtering**: Filter repositories based on relevance scores and code quality
+3. **File Extraction**: Extract code files from filtered repositories
+4. **Categorization**: Classify files into corresponding topic datasets based on keywords and domain characteristics
+## Dataset Characteristics
+1. **Wide Domain Coverage**: Covers multiple interdisciplinary fields including biology, chemistry, materials science, and medicine
+2. **Diverse Code Types**: Includes multiple programming languages such as Python, Java, C++, R, and MATLAB
+3. **Large Scale**: Over 1.1 billion lines of code with a total data size of 115 GB
+4. **Structured Storage**: Each domain topic is stored independently as a CSV file for convenient on-demand usage
+5. **Rich Metadata**: Contains comprehensive metadata including repository information, file paths, and language types
+## Usage Guidelines
+### Data Loading
+```python
+import pandas as pd
+# Load dataset for a specific domain
+df = pd.read_csv('dataset_Protein.csv')
+# View basic dataset information
+print(f"Dataset size: {len(df)} files")
+print(f"Programming language distribution: {df['language'].value_counts()}")
+print(f"File type distribution: {df['file_extension'].value_counts()}")
+```
+### Data Filtering
+```python
+# Filter by programming language
+python_files = df[df['language'] == 'Python']
+# Filter by file size (e.g., files smaller than 100KB)
+small_files = df[df['file_size'] < 100000]
+# Filter by line count
+medium_files = df[(df['line_count'] > 50) & (df['line_count'] < 1000)]
+```
+### Domain-Specific Analysis
+```python
+# Analyze code characteristics for a specific domain
+protein_df = pd.read_csv('dataset_Protein.csv')
+print(f"Number of code files in Protein domain: {len(protein_df)}")
+print(f"Average file size: {protein_df['file_size'].mean():.2f} bytes")
+print(f"Average line count: {protein_df['line_count'].mean():.2f} lines")
+```
+## Important Notes
+1. **File Size**: Some dataset files are large (up to several GB), please be mindful of memory usage when loading
+2. **Encoding**: All files use UTF-8 encoding; ensure proper handling of special characters if encountered
+3. **Data Quality**: Data is sourced from public repositories and may vary in code quality; preprocessing is recommended before use
+4. **License Compliance**: Please comply with the license requirements of the original repositories when using the data