Spaces:
No application file
No application file
File size: 2,597 Bytes
730f9d2 ac3dd75 730f9d2 ac3dd75 730f9d2 34bf0cd ac3dd75 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | ---
title: README
emoji: π’
colorFrom: purple
colorTo: yellow
sdk: docker
pinned: false
---
# Reuben Data Lab
> π Work here was produced for the
> **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)**
> hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** β credit to
> **Adaptive Data by Adaption** for organizing the hackathon.
Building **open, underserved datasets** for training and evaluating modern
audio, speech, and multimodal models. Every release is open-sourced on
Hugging Face with permissive licensing and rich metadata.
## Interactive dashboard
An interactive explorer with donut charts, a language Voronoi treemap,
and a full dataset breakdown is published as a separate Space:
**π [huggingface.co/spaces/Reubencf/dataset-explorer](https://huggingface.co/spaces/Reubencf/dataset-explorer)**
## Collections
Datasets are grouped into four collections under this org.
- [π§ Audio](https://huggingface.co/collections/ReubenDataLab/audio) β
FMA-labeled music, multilingual synthetic TTS, PolyglotAudio.
- [π Text](https://huggingface.co/collections/ReubenDataLab/text) β
PolyglotText and current-affairs 2023-2026.
- [πΌοΈ Images](https://huggingface.co/collections/ReubenDataLab/images) β
globally-sampled street imagery, multilingual magazine OCR + VQA.
- [π» Coding](https://huggingface.co/collections/ReubenDataLab/coding) β
frontend HTML / Tailwind / JS prompts.
The full Adaption-remastered versions live in the
[Reubencf/proper-adaption](https://huggingface.co/collections/Reubencf/proper-adaption)
collection.
## Focus areas
- **Under-resourced languages** β speech and text coverage beyond
English-only datasets.
- **Rich supervision** β structured metadata per row (genre/mood/BPM/key
for music; language/style/voice for speech; geographic and scene
classification for imagery), not just raw data + class labels.
- **Permissive licensing** β Creative Commons, CC-0, MIT where possible.
- **Reproducibility** β generation pipelines and labeling scripts are
open-sourced alongside the data.
## Tooling
- **Labeling**: Google Gemini, Gemma 4 31B via vLLM, Cohere Command R
with RAG grounding.
- **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2Γ H100 with
zero-shot voice cloning.
- **Remastering**: Adaption's Adaptive Data platform.
- **Distribution**: Hugging Face Hub, with HF Jobs for on-platform
data processing.
## Contact
- Hugging Face: [@Reubencf](https://huggingface.co/Reubencf)
- Org home: [ReubenDataLab](https://huggingface.co/ReubenDataLab)
|