Spaces:

ReubenDataLab
/

README

No application file

App Files Files Community

README / README.md

Reubencf

Revert to plain markdown README; link out to standalone dataset-explorer Space (#5)

ac3dd75 4 days ago

preview code

raw

history blame contribute delete

2.6 kB

metadata

title: README
emoji: 🏢
colorFrom: purple
colorTo: yellow
sdk: docker
pinned: false

Reuben Data Lab

🏆 Work here was produced for the Uncharted Data Challenge hosted by Adaption Labs — credit to Adaptive Data by Adaption for organizing the hackathon.

Building open, underserved datasets for training and evaluating modern audio, speech, and multimodal models. Every release is open-sourced on Hugging Face with permissive licensing and rich metadata.

Interactive dashboard

An interactive explorer with donut charts, a language Voronoi treemap, and a full dataset breakdown is published as a separate Space:

👉 huggingface.co/spaces/Reubencf/dataset-explorer

Collections

Datasets are grouped into four collections under this org.

🎧 Audio — FMA-labeled music, multilingual synthetic TTS, PolyglotAudio.
📝 Text — PolyglotText and current-affairs 2023-2026.
🖼️ Images — globally-sampled street imagery, multilingual magazine OCR + VQA.
💻 Coding — frontend HTML / Tailwind / JS prompts.

The full Adaption-remastered versions live in the Reubencf/proper-adaption collection.

Focus areas

Under-resourced languages — speech and text coverage beyond English-only datasets.
Rich supervision — structured metadata per row (genre/mood/BPM/key for music; language/style/voice for speech; geographic and scene classification for imagery), not just raw data + class labels.
Permissive licensing — Creative Commons, CC-0, MIT where possible.
Reproducibility — generation pipelines and labeling scripts are open-sourced alongside the data.

Tooling

Labeling: Google Gemini, Gemma 4 31B via vLLM, Cohere Command R with RAG grounding.
Speech synthesis: Qwen3-TTS-12Hz-1.7B-Base on 2× H100 with zero-shot voice cloning.
Remastering: Adaption's Adaptive Data platform.
Distribution: Hugging Face Hub, with HF Jobs for on-platform data processing.

Contact

Hugging Face: @Reubencf
Org home: ReubenDataLab