| ## Practical Recommendations |
|
|
| ### **Structure of data in a HuggingFace datasets** |
|
|
| #### Datasets, sub-datasets, splits |
|
|
| * A HuggingFace dataset contains multiple sub-datasets e.g. at different filter/stringency levels. |
| * Each sub-dataset has one or more splits, typically ('train', 'validate', 'test'). If the data does not have splits it will be 'train'. |
| * The data in different splits of a single sub-dataset should non-overlapping |
| * Example: |
| * The [MegaScale](https://huggingface.co/datasets/RosettaCommons/MegaScale) contains 6 datasets |
| * dataset1 \# all stability measurements |
| * dataset2 \# high-quality folding stabilities |
| * dataset3 \# ΔG measurements |
| * dataset3\_single \# ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits |
| * dataset3\_single\_cv \# 5-fold cross validation of ΔG measurements of single-point mutants with ThermoMPNN (Dieckhaus, et al., 2024\) splits |
| * To load a specific subdataset: |
| ``` |
| datasets.load_dataset(path = "RosettaCommons/MegaScale", name = "dataset1", data_dir = "dataset1") |
| ``` |
| |
| #### Example: One .csv file dataset |
| |
| One table named `outcomes.csv` to be pushed to HuggingFace dataset repository `maomlab/example_dataset` |
| First load the dataset locally then push it to the hub: |
|
|
| import datasets |
| dataset = datasets.load_dataset( |
| "csv", |
| data_files ="outcomes.csv", |
| keep_in_memory = True) |
| |
| dataset.push_to_hub(repo_id = "`maomlab/example_dataset`") |
| |
| This will create the following files in the repo |
|
|
| data/ |
| train-00000-of-00001.parquet |
| |
| and add the following to the header of README.md |
|
|
| dataset_info: |
| features: |
| - name: id |
| dtype: int64 |
| - name: value |
| dtype: int64 |
| splits: |
| - name: train |
| num_bytes: 64 |
| num_examples: 4 |
| download_size: 1332 |
| dataset_size: 64 |
| configs: |
| - config_name: default |
| data_files: |
| - split: train |
| path: data/train-* |
| |
| to load these data from HuggingFace |
|
|
| dataset = datasets.load_dataset("maomlab/example_dataset") |
| |
| #### Example: train/valid/test split .csv files |
|
|
| Three tables train.csv, valid.csv, test.csv to be pushed to HuggingFace dataset repository `maomlab/example_dataset` |
| load the three splits into one dataset and push it to the hub: |
|
|
| import datasets |
| dataset = datasets.load_dataset( |
| 'csv', |
| data_dir = "/tmp", |
| data_files = { |
| 'train': 'train.csv', |
| 'valid': 'valid.csv', |
| 'test': 'test.csv'}, |
| keep_in_memory = True) |
| |
| dataset.push_to_hub(repo_id = "maomlab/example_dataset") |
| |
| This will create the following files in the repo |
|
|
| data/ |
| train-00000-of-00001.parquet |
| valid-00000-of-00001.parquet |
| test-00000-of-00001.parquet |
| |
| and add the following to the header of the README.md |
|
|
| dataset_info: |
| features: |
| - name: id |
| dtype: int64 |
| - name: value |
| dtype: int64 |
| splits: |
| - name: train |
| num_bytes: 64 |
| num_examples: 4 |
| - name: valid |
| num_bytes: 64 |
| num_examples: 4 |
| - name: test |
| num_bytes: 64 |
| num_examples: 4 |
| download_size: 3996 |
| dataset_size: 192 |
| configs: |
| - config_name: default |
| data_files: |
| - split: train |
| path: data/train-* |
| - split: valid |
| path: data/valid-* |
| - split: test |
| path: data/test-* |
| |
| to load these data from HuggingFace |
|
|
| dataset = datasets.load_dataset("maomlab/example_dataset") |
| |
| #### Example: sub-datasets |
|
|
| If you have different related datasets (`dataset1.csv`, `dataset2.csv`, `dataset3.csv`) that should go into a single repository but contain different types of data so they aren't just splits of the same dataset, then load each dataset separately and push it to the hub with a given config name. |
|
|
| import datasets |
| dataset1 = datasets.load_dataset('csv', data_files = '/tmp/dataset1.csv', keep_in\_memory = True) |
| dataset2 = datasets.load_dataset('csv', data_files = '/tmp/dataset2.csv', keep_in\_memory = True) |
| dataset3 = datasets.load_dataset('csv', data_files = '/tmp/dataset3.csv', keep_in\_memory = True) |
| |
| dataset1.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset1', data_dir = 'dataset1/data') |
| dataset2.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset2', data_dir = 'dataset2/data') |
| dataset3.push_to_hub(repo_id = "`maomlab/example_dataset`", config_name = 'dataset3', data_dir = 'dataset3/data') |
| |
| This will create the following files in the repo |
|
|
| dataset1/ |
| data/ |
| train-00000-of-00001.parquet |
| dataset2/ |
| data/ |
| train-00000-of-00001.parquet |
| dataset3/ |
| data/ |
| train-00000-of-00001.parquet |
| |
| and add the following to the header of the README.md |
|
|
| dataset_info: |
| - config_name: dataset1 |
| features: |
| - name: id |
| dtype: int64 |
| - name: value1 |
| dtype: int64 |
| splits: |
| - name: train |
| num_bytes: 64 |
| num_examples: 4 |
| download_size: 1344 |
| dataset_size: 64 |
| - config_name: dataset2 |
| features: |
| - name: id |
| dtype: int64 |
| - name: value2 |
| dtype: int64 |
| splits: |
| - name: train |
| num_bytes: 64 |
| num_examples: 4 |
| download_size: 1344 |
| dataset_size: 64 |
| - config_name: dataset3 |
| features: |
| - name: id |
| dtype: int64 |
| - name: value3 |
| dtype: int64 |
| splits: |
| - name: train |
| num_bytes: 64 |
| num_examples: 4 |
| download_size: 1344 |
| dataset_size: 64 |
| configs: |
| - config_name: dataset1 |
| data_files: |
| - split: train |
| path: dataset1/data/train-* |
| - config_name: dataset2 |
| data_files: |
| - split: train |
| path: dataset2/data/train-* |
| - config_name: dataset3 |
| data_files: |
| - split: train |
| path: dataset3/data/train-* |
| |
| to load these datasets from HuggingFace |
|
|
| dataset1 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset1', data_dir = 'dataset1') |
| dataset2 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset2', data_dir = 'dataset2') |
| dataset3 = datasets.load_dataset("maomlab/example_dataset", name = 'dataset3', data_dir = 'dataset3') |
| |
|
|
| ### **Format of a dataset** |
|
|
| A dataset should consist of a single table where each row is a single observation |
| The columns should follow typical database design guidelines |
|
|
| * Identifier columns |
| * sequential key |
| * For example: `[1, 2, 3, ...]` |
| * primary key |
| * single column that uniquely identify each row |
| * distinct for every row |
| * no non-missing values |
| * For example, for a dataset of protein structures from the Protein Data Bank, the PDB ID is the primary key |
| * composite key |
| * A set of columns that uniquely identify each row |
| * Either hierarchical or complementary ids that characterize the observation |
| * For example, for an observation of mutations, the (`structure_id`, `residue_id`, `mutation_aa`) is a unique identifier |
| * additional/foreign key identifiers |
| * identifiers to link the observation with other data |
| * For example |
| * for compounds identified by PubChem SubstanceID, the ZINC ID for the compound could be a foreign key |
| * FDA drug name or the IUPAC substance name |
| * Tidy key/value columns |
| * [Tidy vs array data](https://vita.had.co.nz/papers/tidy-data.pdf) |
| * tidy data sometimes called (long) has one measurement per row |
| * Multiple columns can be used to give details for each measurement including type, units, metadata |
| * Often good for certain data science computational analysis workflows (e.g. tidyverse/dplyr) |
| * Can handle variable number of measurements per object |
| * Duplicates object identifier columns for each measurement |
| * array data sometimes called (wide) has one object per row and multiple measurements as different columns |
| * Typically each measurement is typically a single column |
| * More compact, i.e. no duplication of identifier columns |
| * Good for certain ML/matrix based computational workflows |
|
|
| #### Molecular formats |
|
|
| * Store molecular structure in standard text formats |
| * protein structure: PDB, mmCIF, modelCIF |
| * small molecule: SMILES, InChi |
| * use uncompressed, plaintext format |
| * Easier to computationally analyze |
| * the whole dataset will be compressed for data serialization |
| * Filtering / Standardization / sanitization |
| * Be clear about process methods used to process the molecular data |
| * Be especially careful for inferred / aspects of the data |
| * protonation states, |
| * salt form, stereochemistry for small molecules |
| * data missingness including unstructured loops for proteins |
| * Tools |
| * MolVS is useful for small molecule sanitization |
|
|
| #### Computational data formats |
|
|
| * On disk formats |
| * parquet format disk format |
| * column oriented (so can load only data that is needed, easier to compress) |
| * robust reader/write codes from apache arrow for Python, R etc. |
| * ArrowTable |
| * In memory format closely aligned with the on disk parquet format |
| * Native format for datasets stored in datasets python package |
| * tab/comma separated table |
| * Prefer tab separated, more consistent parsing without needing escaping values |
| * Widely used row-oriented text format for storing tabular data to disk |
| * Does not store data format and often needs custom format conversion code/QC for loading into python/R |
| * Can be compressed on disk but row-oriented, so less compressible than .parquet |
| * .pickle / .Rdata |
| * language specific serialization of complex data structures |
| * Often very fast to read/write, but may not be robust for across language/OS versions |
| * Not easily interoperable across programming languages |
| * In memory formats |
| * R `data.frame`/`dplyr::tibble` |
| * Widely used format for R data science |
| * Out of the box faster for tidyverse data manipulation, split-apply-combine workflows |
| * Python pandas DataFrame |
| * Widely used for python data science |
| * Out of the box not super fast for data science |
| * Python numpy array / R Matrix |
| * Uses single data type for all data |
| * Useful for efficient/matrix manipulation |
| * Python Pytorch dataset |
| * Format specifically geared for loading data for Pytorch deep-learning |
|
|
| Recommendations |
|
|
| * On disk |
| * For small, config level tables use .tsv |
| * For large data format use .parquet |
| * Smaller than .csv/.tsv |
| * Robust open source libraries in major language can read and write .parquet files faster than .csv/.tsv |
| * In memory |
| * Use `dplyr::tibble` / pandas DataFrame for data science tables |
| * Use numpy array / pytorch dataset for machine learning |