Spaces:

ReubenDataLab
/

README

No application file

App Files Files Community

README / README.md

Reubencf

Revert to plain markdown README; link out to standalone dataset-explorer Space (#5)

ac3dd75 5 days ago

preview code

raw

history blame contribute delete

2.6 kB

	---
	title: README
	emoji: 🏢
	colorFrom: purple
	colorTo: yellow
	sdk: docker
	pinned: false
	---

	# Reuben Data Lab

	> 🏆 Work here was produced for the
	> [Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)
	> hosted by [Adaption Labs](https://www.adaptionlabs.ai) — credit to
	> Adaptive Data by Adaption for organizing the hackathon.

	Building open, underserved datasets for training and evaluating modern
	audio, speech, and multimodal models. Every release is open-sourced on
	Hugging Face with permissive licensing and rich metadata.

	## Interactive dashboard

	An interactive explorer with donut charts, a language Voronoi treemap,
	and a full dataset breakdown is published as a separate Space:

	👉 [huggingface.co/spaces/Reubencf/dataset-explorer](https://huggingface.co/spaces/Reubencf/dataset-explorer)

	## Collections

	Datasets are grouped into four collections under this org.

	- [🎧 Audio](https://huggingface.co/collections/ReubenDataLab/audio) —
	FMA-labeled music, multilingual synthetic TTS, PolyglotAudio.
	- [📝 Text](https://huggingface.co/collections/ReubenDataLab/text) —
	PolyglotText and current-affairs 2023-2026.
	- [🖼️ Images](https://huggingface.co/collections/ReubenDataLab/images) —
	globally-sampled street imagery, multilingual magazine OCR + VQA.
	- [💻 Coding](https://huggingface.co/collections/ReubenDataLab/coding) —
	frontend HTML / Tailwind / JS prompts.

	The full Adaption-remastered versions live in the
	[Reubencf/proper-adaption](https://huggingface.co/collections/Reubencf/proper-adaption)
	collection.

	## Focus areas

	- Under-resourced languages — speech and text coverage beyond
	English-only datasets.
	- Rich supervision — structured metadata per row (genre/mood/BPM/key
	for music; language/style/voice for speech; geographic and scene
	classification for imagery), not just raw data + class labels.
	- Permissive licensing — Creative Commons, CC-0, MIT where possible.
	- Reproducibility — generation pipelines and labeling scripts are
	open-sourced alongside the data.

	## Tooling

	- Labeling: Google Gemini, Gemma 4 31B via vLLM, Cohere Command R
	with RAG grounding.
	- Speech synthesis: Qwen3-TTS-12Hz-1.7B-Base on 2× H100 with
	zero-shot voice cloning.
	- Remastering: Adaption's Adaptive Data platform.
	- Distribution: Hugging Face Hub, with HF Jobs for on-platform
	data processing.

	## Contact

	- Hugging Face: [@Reubencf](https://huggingface.co/Reubencf)
	- Org home: [ReubenDataLab](https://huggingface.co/ReubenDataLab)