Spaces:
No application file
No application file
metadata
title: README
emoji: π’
colorFrom: purple
colorTo: yellow
sdk: docker
pinned: false
Reuben Data Lab
π Work here was produced for the Uncharted Data Challenge hosted by Adaption Labs β credit to Adaptive Data by Adaption for organizing the hackathon.
Building open, underserved datasets for training and evaluating modern audio, speech, and multimodal models. Every release is open-sourced on Hugging Face with permissive licensing and rich metadata.
Interactive dashboard
An interactive explorer with donut charts, a language Voronoi treemap, and a full dataset breakdown is published as a separate Space:
π huggingface.co/spaces/Reubencf/dataset-explorer
Collections
Datasets are grouped into four collections under this org.
- π§ Audio β FMA-labeled music, multilingual synthetic TTS, PolyglotAudio.
- π Text β PolyglotText and current-affairs 2023-2026.
- πΌοΈ Images β globally-sampled street imagery, multilingual magazine OCR + VQA.
- π» Coding β frontend HTML / Tailwind / JS prompts.
The full Adaption-remastered versions live in the Reubencf/proper-adaption collection.
Focus areas
- Under-resourced languages β speech and text coverage beyond English-only datasets.
- Rich supervision β structured metadata per row (genre/mood/BPM/key for music; language/style/voice for speech; geographic and scene classification for imagery), not just raw data + class labels.
- Permissive licensing β Creative Commons, CC-0, MIT where possible.
- Reproducibility β generation pipelines and labeling scripts are open-sourced alongside the data.
Tooling
- Labeling: Google Gemini, Gemma 4 31B via vLLM, Cohere Command R with RAG grounding.
- Speech synthesis: Qwen3-TTS-12Hz-1.7B-Base on 2Γ H100 with zero-shot voice cloning.
- Remastering: Adaption's Adaptive Data platform.
- Distribution: Hugging Face Hub, with HF Jobs for on-platform data processing.
Contact
- Hugging Face: @Reubencf
- Org home: ReubenDataLab