Spaces:
No application file
No application file
| title: README | |
| emoji: π’ | |
| colorFrom: purple | |
| colorTo: yellow | |
| sdk: docker | |
| pinned: false | |
| # Reuben Data Lab | |
| > π Work here was produced for the | |
| > **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)** | |
| > hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** β credit to | |
| > **Adaptive Data by Adaption** for organizing the hackathon. | |
| Building **open, underserved datasets** for training and evaluating modern | |
| audio, speech, and multimodal models. Every release is open-sourced on | |
| Hugging Face with permissive licensing and rich metadata. | |
| ## Interactive dashboard | |
| An interactive explorer with donut charts, a language Voronoi treemap, | |
| and a full dataset breakdown is published as a separate Space: | |
| **π [huggingface.co/spaces/Reubencf/dataset-explorer](https://huggingface.co/spaces/Reubencf/dataset-explorer)** | |
| ## Collections | |
| Datasets are grouped into four collections under this org. | |
| - [π§ Audio](https://huggingface.co/collections/ReubenDataLab/audio) β | |
| FMA-labeled music, multilingual synthetic TTS, PolyglotAudio. | |
| - [π Text](https://huggingface.co/collections/ReubenDataLab/text) β | |
| PolyglotText and current-affairs 2023-2026. | |
| - [πΌοΈ Images](https://huggingface.co/collections/ReubenDataLab/images) β | |
| globally-sampled street imagery, multilingual magazine OCR + VQA. | |
| - [π» Coding](https://huggingface.co/collections/ReubenDataLab/coding) β | |
| frontend HTML / Tailwind / JS prompts. | |
| The full Adaption-remastered versions live in the | |
| [Reubencf/proper-adaption](https://huggingface.co/collections/Reubencf/proper-adaption) | |
| collection. | |
| ## Focus areas | |
| - **Under-resourced languages** β speech and text coverage beyond | |
| English-only datasets. | |
| - **Rich supervision** β structured metadata per row (genre/mood/BPM/key | |
| for music; language/style/voice for speech; geographic and scene | |
| classification for imagery), not just raw data + class labels. | |
| - **Permissive licensing** β Creative Commons, CC-0, MIT where possible. | |
| - **Reproducibility** β generation pipelines and labeling scripts are | |
| open-sourced alongside the data. | |
| ## Tooling | |
| - **Labeling**: Google Gemini, Gemma 4 31B via vLLM, Cohere Command R | |
| with RAG grounding. | |
| - **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2Γ H100 with | |
| zero-shot voice cloning. | |
| - **Remastering**: Adaption's Adaptive Data platform. | |
| - **Distribution**: Hugging Face Hub, with HF Jobs for on-platform | |
| data processing. | |
| ## Contact | |
| - Hugging Face: [@Reubencf](https://huggingface.co/Reubencf) | |
| - Org home: [ReubenDataLab](https://huggingface.co/ReubenDataLab) | |