--- title: README emoji: 🏢 colorFrom: purple colorTo: yellow sdk: docker pinned: false --- # Reuben Data Lab > 🏆 Work here was produced for the > **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)** > hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** — credit to > **Adaptive Data by Adaption** for organizing the hackathon. Building **open, underserved datasets** for training and evaluating modern audio, speech, and multimodal models. Every release is open-sourced on Hugging Face with permissive licensing and rich metadata. ## Interactive dashboard An interactive explorer with donut charts, a language Voronoi treemap, and a full dataset breakdown is published as a separate Space: **👉 [huggingface.co/spaces/Reubencf/dataset-explorer](https://huggingface.co/spaces/Reubencf/dataset-explorer)** ## Collections Datasets are grouped into four collections under this org. - [🎧 Audio](https://huggingface.co/collections/ReubenDataLab/audio) — FMA-labeled music, multilingual synthetic TTS, PolyglotAudio. - [📝 Text](https://huggingface.co/collections/ReubenDataLab/text) — PolyglotText and current-affairs 2023-2026. - [🖼️ Images](https://huggingface.co/collections/ReubenDataLab/images) — globally-sampled street imagery, multilingual magazine OCR + VQA. - [💻 Coding](https://huggingface.co/collections/ReubenDataLab/coding) — frontend HTML / Tailwind / JS prompts. The full Adaption-remastered versions live in the [Reubencf/proper-adaption](https://huggingface.co/collections/Reubencf/proper-adaption) collection. ## Focus areas - **Under-resourced languages** — speech and text coverage beyond English-only datasets. - **Rich supervision** — structured metadata per row (genre/mood/BPM/key for music; language/style/voice for speech; geographic and scene classification for imagery), not just raw data + class labels. - **Permissive licensing** — Creative Commons, CC-0, MIT where possible. - **Reproducibility** — generation pipelines and labeling scripts are open-sourced alongside the data. ## Tooling - **Labeling**: Google Gemini, Gemma 4 31B via vLLM, Cohere Command R with RAG grounding. - **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2× H100 with zero-shot voice cloning. - **Remastering**: Adaption's Adaptive Data platform. - **Distribution**: Hugging Face Hub, with HF Jobs for on-platform data processing. ## Contact - Hugging Face: [@Reubencf](https://huggingface.co/Reubencf) - Org home: [ReubenDataLab](https://huggingface.co/ReubenDataLab)