File size: 2,597 Bytes
730f9d2
 
ac3dd75
730f9d2
 
ac3dd75
730f9d2
 
 
34bf0cd
 
ac3dd75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
title: README
emoji: 🏒
colorFrom: purple
colorTo: yellow
sdk: docker
pinned: false
---

# Reuben Data Lab

> πŸ† Work here was produced for the
> **[Uncharted Data Challenge](https://www.adaptionlabs.ai/blog/the-uncharted-data-challenge)**
> hosted by **[Adaption Labs](https://www.adaptionlabs.ai)** β€” credit to
> **Adaptive Data by Adaption** for organizing the hackathon.

Building **open, underserved datasets** for training and evaluating modern
audio, speech, and multimodal models. Every release is open-sourced on
Hugging Face with permissive licensing and rich metadata.

## Interactive dashboard

An interactive explorer with donut charts, a language Voronoi treemap,
and a full dataset breakdown is published as a separate Space:

**πŸ‘‰ [huggingface.co/spaces/Reubencf/dataset-explorer](https://huggingface.co/spaces/Reubencf/dataset-explorer)**

## Collections

Datasets are grouped into four collections under this org.

- [🎧 Audio](https://huggingface.co/collections/ReubenDataLab/audio) β€”
  FMA-labeled music, multilingual synthetic TTS, PolyglotAudio.
- [πŸ“ Text](https://huggingface.co/collections/ReubenDataLab/text) β€”
  PolyglotText and current-affairs 2023-2026.
- [πŸ–ΌοΈ Images](https://huggingface.co/collections/ReubenDataLab/images) β€”
  globally-sampled street imagery, multilingual magazine OCR + VQA.
- [πŸ’» Coding](https://huggingface.co/collections/ReubenDataLab/coding) β€”
  frontend HTML / Tailwind / JS prompts.

The full Adaption-remastered versions live in the
[Reubencf/proper-adaption](https://huggingface.co/collections/Reubencf/proper-adaption)
collection.

## Focus areas

- **Under-resourced languages** β€” speech and text coverage beyond
  English-only datasets.
- **Rich supervision** β€” structured metadata per row (genre/mood/BPM/key
  for music; language/style/voice for speech; geographic and scene
  classification for imagery), not just raw data + class labels.
- **Permissive licensing** β€” Creative Commons, CC-0, MIT where possible.
- **Reproducibility** β€” generation pipelines and labeling scripts are
  open-sourced alongside the data.

## Tooling

- **Labeling**: Google Gemini, Gemma 4 31B via vLLM, Cohere Command R
  with RAG grounding.
- **Speech synthesis**: Qwen3-TTS-12Hz-1.7B-Base on 2Γ— H100 with
  zero-shot voice cloning.
- **Remastering**: Adaption's Adaptive Data platform.
- **Distribution**: Hugging Face Hub, with HF Jobs for on-platform
  data processing.

## Contact

- Hugging Face: [@Reubencf](https://huggingface.co/Reubencf)
- Org home: [ReubenDataLab](https://huggingface.co/ReubenDataLab)