| ## **3 Create Dataset** |
|
|
| #### Set up Personal Access Tokens (PAT) |
|
|
| See the help page on how to set up [security tokens](https://huggingface.co/docs/hub/en/security-tokens). This is needed to clone/push the repository using git |
|
|
| * Navigate to: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) |
| * Click Create New Token β fill out information |
| * Save the token, e.g. in a password manager |
|
|
| After retriving your personal access token, you can set up git with HuggingFace via command-line. Briefly, this looks like: |
| ``` |
| pip install huggingface_hub |
| huggingface-cli login |
| ``` |
|
|
| #### Data processing workflow overview |
|
|
| 1. Create pilot datasets in personal space and then once ready transfer to the Rosetta Data Bazaar collection |
| 1. Click Name icon β [New β Dataset](https://huggingface.co/new) |
| 1. Fill out dataset name |
| 2. Navigate to "Files and Versions" β README.md |
| 3. Fill out the top Dataset Card metadata (you can come back and fill out more details later) |
| 2. Web-workflow |
| 1. Edit READ.md directly in the browser |
| 2. upload/delete other files directly |
| 3. Add any data processing scripts/workflows for reproducibility |
| 1. `git clone https://huggingface.co/datasets/<username>/<repo-name>` |
| 2. create analysis folder structure, such as: |
| ``` |
| src/ \# scripts for data curation |
| data/ \# stored raw data for processing/curation |
| intermediate/ \# store processed/curated data for uploading |
| ``` |
| 3. Add `.gitignore` |
| ``` |
| data/* |
| intermediate/* |
| ``` |
| 4. Use standard git workflow for modifying README.md and curation scripts |
| |
| #### Uploading data to HuggingFace |
|
|
| Steps to upload data |
|
|
| 1. Create the dataset locally using `datasets.load_dataset(...)` |
| 2. Call `datasets.push_to_hub(...)` to upload the data |
|
|
| For example |
|
|
| import datasets |
| dataset = datasets.load_dataset( |
| "csv", |
| data_files = "outcomes.csv", |
| keep_in_memory = True) |
| |
| dataset.push_to_hub(repo_id = "`maomlab/example_dataset`") |
| |
| ***NOTE: Don't just drag-and-drop data, as it won't be possible to download the data remotely using datasets.load\_data(...)*** |
|
|
| If your dataset is more complex |
|
|
| * see below in the section "**Structure of data in a HuggingFace datasets**" for guidance on how to organize the dataset |
| * See other dataset in the Rosetta Data Bazaar |
|
|
|
|
| #### Downloading data from HuggingFace |
|
|
| To load the dataset remotely, |
|
|
| dataset = datasets.load_dataset(path = repo_id) |
| |
| optionally select specific split and/or columns to download a subset |
|
|
| dataset_tag = "<dataset_tag>" |
| dataset = datasets.load_dataset( |
| path = repo_id, |
| name = dataset_tag, |
| data_dir = dataset_tag, |
| cache_dir = cache_dir, |
| keep_in_memory = True) |
| |
| If needed, convert data to pandas |
|
|
| import pandas as pd |
| df = dataset.data['train'].to_pandas() |
| |