Spaces:

RosettaCommons
/

MolecularDatasetCurationGuide

Sleeping

App Files Files Community

MolecularDatasetCurationGuide / sections /07_practical_recommendations.md

maom

Update sections/07_practical_recommendations.md

92dbf79 verified 23 days ago

preview code

raw

history blame contribute delete

11.4 kB

	## Practical Recommendations

	### Structure of data in a HuggingFace datasets

	#### Datasets, sub-datasets, splits

	* A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels.
	* Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'.
	* The data in different splits of a single sub-dataset should non-overlapping
	* Example:
	* The [MegaScale](https://huggingface.co/datasets/RosettaCommons/MegaScale) contains 6 datasets
	* dataset1 \# all stability measurements
	* dataset2 \# high-quality folding stabilities
	* dataset3 \# ΔG measurements
	* dataset3\_single \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
	* dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits
	* To load a specific subdataset:
	```
	datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1")
	```

	#### Example: One .csv file dataset

	One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
	First load the dataset locally then push it to the hub:

	import datasets
	dataset = datasets.load_dataset(
	"csv",
	data_files ="outcomes.csv",
	keep_in_memory = True)

	dataset.push_to_hub(repo_id = "`maomlab/example_dataset`")

	This will create the following files in the repo

	data/
	train-00000-of-00001.parquet

	and add the following to the header of README.md

	dataset_info:
	features:
	- name: id
	dtype: int64
	- name: value
	dtype: int64
	splits:
	- name: train
	num_bytes: 64
	num_examples: 4
	download_size: 1332
	dataset_size: 64
	configs:
	- config_name: default
	data_files:
	- split: train
	path: data/train-*

	to load these data from HuggingFace

	dataset = datasets.load_dataset("maomlab/example_dataset")

	#### Example: train/valid/test split .csv files

	Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset`
	load the three splits into one dataset and push it to the hub:

	import datasets
	dataset = datasets.load_dataset(
	'csv',
	data_dir = "/tmp",
	data_files = {
	'train': 'train.csv',
	'valid': 'valid.csv',
	'test': 'test.csv'},
	keep_in_memory = True)

	dataset.push_to_hub(repo_id = "maomlab/example_dataset")

	This will create the following files in the repo

	data/
	train-00000-of-00001.parquet
	valid-00000-of-00001.parquet
	test-00000-of-00001.parquet

	and add the following to the header of the README.md

	dataset_info:
	features:
	- name: id
	dtype: int64
	- name: value
	dtype: int64
	splits:
	- name: train
	num_bytes: 64
	num_examples: 4
	- name: valid
	num_bytes: 64
	num_examples: 4
	- name: test
	num_bytes: 64
	num_examples: 4
	download_size: 3996
	dataset_size: 192
	configs:
	- config_name: default
	data_files:
	- split: train
	path: data/train-*
	- split: valid
	path: data/valid-*
	- split: test
	path: data/test-*

	to load these data from HuggingFace

	dataset = datasets.load_dataset("maomlab/example_dataset")

	#### Example: sub-datasets

	If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name.

	import datasets
	dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True)
	dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True)
	dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True)

	dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data')
	dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data')
	dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data')

	This will create the following files in the repo

	dataset1/
	data/
	train-00000-of-00001.parquet
	dataset2/
	data/
	train-00000-of-00001.parquet
	dataset3/
	data/
	train-00000-of-00001.parquet

	and add the following to the header of the README.md

	dataset_info:
	- config_name: dataset1
	features:
	- name: id
	dtype: int64
	- name: value1
	dtype: int64
	splits:
	- name: train
	num_bytes: 64
	num_examples: 4
	download_size: 1344
	dataset_size: 64
	- config_name: dataset2
	features:
	- name: id
	dtype: int64
	- name: value2
	dtype: int64
	splits:
	- name: train
	num_bytes: 64
	num_examples: 4
	download_size: 1344
	dataset_size: 64
	- config_name: dataset3
	features:
	- name: id
	dtype: int64
	- name: value3
	dtype: int64
	splits:
	- name: train
	num_bytes: 64
	num_examples: 4
	download_size: 1344
	dataset_size: 64
	configs:
	- config_name: dataset1
	data_files:
	- split: train
	path: dataset1/data/train-*
	- config_name: dataset2
	data_files:
	- split: train
	path: dataset2/data/train-*
	- config_name: dataset3
	data_files:
	- split: train
	path: dataset3/data/train-*

	to load these datasets from HuggingFace

	dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1')
	dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2')
	dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3')


	### Format of a dataset

	A dataset should consist of a single table where each row is a single observation
	The columns should follow typical database design guidelines

	* Identifier columns
	* sequential key
	* For example: `[1, 2, 3, ...]`
	* primary key
	* single column that uniquely identify each row
	* distinct for every row
	* no non-missing values
	* For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key
	* composite key
	* A set of columns that uniquely identify each row
	* Either hierarchical or complementary ids that characterize the observation
	* For example, for an observation of mutations, the (`structure_id`, `residue_id`, `mutation_aa`) is a unique identifier
	* additional/foreign key identifiers
	* identifiers to link the observation with other data
	* For example
	* for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key
	* FDA drug name or the IUPAC substance name
	* Tidy key/value columns
	* [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf)
	* tidy data sometimes called (long) has one measurement per row
	* Multiple columns can be used to give details for each measurement including type, units, metadata
	* Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr)
	* Can handle variable number of measurements per object
	* Duplicates object identifier columns for each measurement
	* array data sometimes called (wide) has one object per row and multiple measurements as different columns
	* Typically each measurement is typically a single column
	* More compact, i.e. no duplication of identifier columns
	* Good for certain ML/matrix based computational workflows

	#### Molecular formats

	* Store molecular structure in standard text formats
	* protein structure: PDB, mmCIF, modelCIF
	* small molecule: SMILES, InChi
	* use uncompressed, plaintext format
	* Easier to computationally analyze
	* the whole dataset will be compressed for data serialization
	* Filtering / Standardization / sanitization
	* Be clear about process methods used to process the molecular data
	* Be especially careful for inferred / aspects of the data
	* protonation states,
	* salt form, stereochemistry for small molecules
	* data missingness including unstructured loops for proteins
	* Tools
	* MolVS is useful for small molecule sanitization

	#### Computational data formats

	* On disk formats
	* parquet format disk format
	* column oriented (so can load only data that is needed, easier to compress)
	* robust reader/write codes from apache arrow for Python, R etc.
	* ArrowTable
	* In memory format closely aligned with the on disk parquet format
	* Native format for datasets stored in datasets python package
	* tab/comma separated table
	* Prefer tab separated, more consistent parsing without needing escaping values
	* Widely used row-oriented text format for storing tabular data to disk
	* Does not store data format and often needs custom format conversion code/QC for loading into python/R
	* Can be compressed on disk but row-oriented, so less compressible than .parquet
	* .pickle / .Rdata
	* language specific serialization of complex data structures
	* Often very fast to read/write, but may not be robust for across language/OS versions
	* Not easily interoperable across programming languages
	* In memory formats
	* R `data.frame`/`dplyr::tibble`
	* Widely used format for R data science
	* Out of the box faster for tidyverse data manipulation, split-apply-combine workflows
	* Python pandas DataFrame
	* Widely used for python data science
	* Out of the box not super fast for data science
	* Python numpy array / R Matrix
	* Uses single data type for all data
	* Useful for efficient/matrix manipulation
	* Python Pytorch dataset
	* Format specifically geared for loading data for Pytorch deep-learning

	Recommendations

	* On disk
	* For small, config level tables use .tsv
	* For large data format use .parquet
	* Smaller than .csv/.tsv
	* Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv
	* In memory
	* Use `dplyr::tibble` / pandas DataFrame for data science tables
	* Use numpy array / pytorch dataset for machine learning