DouDou commited on
Commit
e5d8191
·
verified ·
1 Parent(s): ddd7e5c

Upload data1/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. data1/README.md +133 -0
data1/README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DATA1: Domain-Specific Code Dataset
2
+
3
+ ## Dataset Overview
4
+
5
+ DATA1 is a large-scale domain-specific code dataset focusing on code samples from interdisciplinary fields such as biology, chemistry, materials science, and related areas. The dataset is collected and organized from GitHub repositories, covering 178 different domain topics with over 1.1 billion lines of code.
6
+
7
+ ## Dataset Statistics
8
+
9
+ - **Total Datasets**: 178 CSV files
10
+ - **Total Data Size**: ~115 GB
11
+ - **Total Lines of Code**: Over 1.1 billion lines
12
+ - **Data Format**: CSV (Comma-Separated Values)
13
+ - **Encoding**: UTF-8
14
+
15
+ ## Dataset Structure
16
+
17
+ Each CSV file corresponds to a specific domain topic, with the naming format `dataset_{Topic}.csv`, where `{Topic}` is the domain keyword (e.g., Protein, Drug, Genomics).
18
+
19
+ ### Data Field Description
20
+
21
+ Each CSV file contains the following fields:
22
+
23
+ | Field Name | Type | Description |
24
+ |------------|------|-------------|
25
+ | `keyword` | String | Domain keyword used to identify the domain of the code sample |
26
+ | `repo_name` | String | GitHub repository name (format: owner/repo) |
27
+ | `file_path` | String | Relative path of the file in the repository |
28
+ | `file_extension` | String | File extension (e.g., .py, .java, .cpp) |
29
+ | `file_size` | Integer | File size in bytes |
30
+ | `line_count` | Integer | Number of lines of code in the file |
31
+ | `content` | String | Complete file content |
32
+ | `language` | String | Programming language (e.g., Python, Java, C++) |
33
+
34
+ ## Domain Categories
35
+
36
+ The dataset covers the following major domain categories:
37
+
38
+ ### Biology-Related
39
+ - **Molecular Biology**: Protein, DNA, RNA, Gene, Enzyme, Receptor, Ligand
40
+ - **Cell Biology**: Cell_biology, Single_cell, Cell_atlas, Organoid
41
+ - **Genomics**: Genomics, Genotype, Phenotype, Epigenetics, Metagenomics
42
+ - **Transcriptomics**: Transcriptomics, Spatial_Transcriptomics, Transcription, Translation
43
+ - **Proteomics**: Proteomics, Protein_Protein_Interactions, Folding
44
+ - **Metabolomics**: Metabolomics, Metabolic, Lipidomics, Glycomics
45
+ - **Systems Biology**: System_biology, Signaling, Pathway, Networks
46
+
47
+ ### Chemistry-Related
48
+ - **Computational Chemistry**: Computational_Chemistry, Quantum_Chemistry, DFT, QM_MM
49
+ - **Medicinal Chemistry**: Drug, ADMET, QSAR, Docking, Lead_discovery, Lead_optimization
50
+ - **Materials Chemistry**: Material, Crystal, Conformation, Chemical_space
51
+ - **Reaction Chemistry**: Reaction, Kinetics, Mechanism, Redox
52
+
53
+ ### Medicine and Pharmacology
54
+ - **Pharmacology**: Pharmacology, Pharmacokinetics, Pharmacogenomics, Pharmacogenetics
55
+ - **Medicine**: Medicine, Disease, Diagnostics, Pathology, Vaccine
56
+ - **Toxicology**: Toxicology, Biomarker, Marker
57
+
58
+ ### Computational Methods
59
+ - **Machine Learning**: Transformer, GAN, VAE, Diffusion, Flow_matching, Reinforcement_learning
60
+ - **Quantum Computing**: Quantum_mechanics, Quantum_biology, Electronic_structure
61
+ - **Modeling Methods**: Modeling, Multi_scale_modeling, Agent_based_model, Stochastic_modeling
62
+ - **Numerical Methods**: Monte_Carlo, Finite_element_method, Phase_field_technique
63
+
64
+ ### Other Specialized Fields
65
+ - **Bioinformatics**: Bioinformatics, Cheminformatics, Next_generation_sequencing
66
+ - **Bioengineering**: Bioengineering, Biotechnology, Biosensors
67
+ - **Immunology**: Immunology, Antibody, Antigen, Antagonist
68
+ - **Virology**: Viral, Pandemic, Pathogens, AMR (Antimicrobial Resistance)
69
+
70
+ ## Data Source
71
+
72
+ The data is collected from open-source repositories on GitHub through the following process:
73
+
74
+ 1. **Keyword Search**: Search for relevant repositories on GitHub using domain-specific keywords
75
+ 2. **Repository Filtering**: Filter repositories based on relevance scores and code quality
76
+ 3. **File Extraction**: Extract code files from filtered repositories
77
+ 4. **Categorization**: Classify files into corresponding topic datasets based on keywords and domain characteristics
78
+
79
+ ## Dataset Characteristics
80
+
81
+ 1. **Wide Domain Coverage**: Covers multiple interdisciplinary fields including biology, chemistry, materials science, and medicine
82
+ 2. **Diverse Code Types**: Includes multiple programming languages such as Python, Java, C++, R, and MATLAB
83
+ 3. **Large Scale**: Over 1.1 billion lines of code with a total data size of 115 GB
84
+ 4. **Structured Storage**: Each domain topic is stored independently as a CSV file for convenient on-demand usage
85
+ 5. **Rich Metadata**: Contains comprehensive metadata including repository information, file paths, and language types
86
+
87
+ ## Usage Guidelines
88
+
89
+ ### Data Loading
90
+
91
+ ```python
92
+ import pandas as pd
93
+
94
+ # Load dataset for a specific domain
95
+ df = pd.read_csv('dataset_Protein.csv')
96
+
97
+ # View basic dataset information
98
+ print(f"Dataset size: {len(df)} files")
99
+ print(f"Programming language distribution: {df['language'].value_counts()}")
100
+ print(f"File type distribution: {df['file_extension'].value_counts()}")
101
+ ```
102
+
103
+ ### Data Filtering
104
+
105
+ ```python
106
+ # Filter by programming language
107
+ python_files = df[df['language'] == 'Python']
108
+
109
+ # Filter by file size (e.g., files smaller than 100KB)
110
+ small_files = df[df['file_size'] < 100000]
111
+
112
+ # Filter by line count
113
+ medium_files = df[(df['line_count'] > 50) & (df['line_count'] < 1000)]
114
+ ```
115
+
116
+ ### Domain-Specific Analysis
117
+
118
+ ```python
119
+ # Analyze code characteristics for a specific domain
120
+ protein_df = pd.read_csv('dataset_Protein.csv')
121
+ print(f"Number of code files in Protein domain: {len(protein_df)}")
122
+ print(f"Average file size: {protein_df['file_size'].mean():.2f} bytes")
123
+ print(f"Average line count: {protein_df['line_count'].mean():.2f} lines")
124
+ ```
125
+
126
+ ## Important Notes
127
+
128
+ 1. **File Size**: Some dataset files are large (up to several GB), please be mindful of memory usage when loading
129
+ 2. **Encoding**: All files use UTF-8 encoding; ensure proper handling of special characters if encountered
130
+ 3. **Data Quality**: Data is sourced from public repositories and may vary in code quality; preprocessing is recommended before use
131
+ 4. **License Compliance**: Please comply with the license requirements of the original repositories when using the data
132
+
133
+