Spaces:
Sleeping
Sleeping
| title: Sahel-Voice-Lab — Minimal | |
| emoji: 🌍 | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: "5.25.0" | |
| app_file: app_minimal.py | |
| hardware: cpu-basic | |
| pinned: false | |
| license: mit | |
| tags: | |
| - bambara | |
| - fula | |
| - speech-recognition | |
| - text-to-speech | |
| - agriculture | |
| - iot | |
| - language-learning | |
| - west-africa | |
| - low-resource-nlp | |
| - memory | |
| # 🌍 Sahel-Voice-Lab | |
| **A voice-first AI assistant for Bambara (Mali) and Fula/Pular (Guinea, Senegal).** | |
| Two intertwined jobs: | |
| 1. **Memory loop** — users *teach* the assistant new words; it persists them to a HuggingFace dataset and uses them as the source of truth in future answers. | |
| 2. **Agricultural IoT voice interface** — Sahelian farmers query soil, weather, irrigation, and pest data in their own language, short answers, ≤ 6 words per sentence for clean TTS. | |
| The core stack is explicitly **100% non-Meta** (Whisper / Aya-Expanse / F5-TTS / VITS); MMS-TTS is only used as a baseline fallback. | |
| --- | |
| ## What this Space currently runs — the `ground-zero` minimal baseline | |
| The deployed Space (`app_file: app_minimal.py`) is the **Month 1–3 rebuild** | |
| baseline — a stripped-down Whisper → LLM → MMS-TTS pipeline used for field | |
| testing and to build a real-user eval set. No LoRA adapters, no memory loop, | |
| no speaker ID, no voice cloning, no IoT, no phrase matcher. Everything in | |
| `app.py` still exists for the full production stack; it is just not what the | |
| Space serves today. | |
| Three stacked changes land dialect fidelity without any training: | |
| 1. **Stage 1 — dialect-pinned system prompt** (`src/llm/minimal_client.py`). | |
| Replaces the `GemmaClient` JSON/teacher flow with a plain-text client whose | |
| system prompt pins the target dialect explicitly — *Bambara as spoken in | |
| Bamako, Mali* and *Pular of Fuuta Jallon, as spoken in Guinea* — names the | |
| languages the model must **not** drift into (Wolof, Hausa, Pulaar of | |
| Senegal, Fulfulde of Nigeria, Jula of Côte d'Ivoire), and injects a 30-pair | |
| bilingual gold list as few-shot anchoring | |
| (`configs/dialect_anchors/{bambara_mali,pular_guinea}.json`). | |
| 2. **Stage 2 — curated phrasebook short-circuit** (`src/llm/phrasebook.py`). | |
| Before calling the LLM, the user's input is normalised and fuzzy-matched | |
| (threshold 0.88) against a curated English-keyed phrasebook | |
| (`configs/dialect_anchors/{bambara,pular}_phrasebook.json` — 100 Bambara / | |
| 110 Pular entries across greetings, family, food, farming, health, | |
| shopping, travel, clarity, time, parting). A hit returns the gold | |
| translation directly — zero LLM risk, zero latency. | |
| 3. **Stage 3 — better multilingual base LLM.** | |
| Default `LLM_MODEL_ID` is now **`CohereLabs/aya-expanse-32b`**, a 23-language | |
| multilingual model with much stronger West African coverage than Qwen | |
| 2.5-7B. Can be overridden via the `LLM_MODEL_ID` env var (e.g. to | |
| `Qwen/Qwen2.5-72B-Instruct`) if Cohere's inference provider is not | |
| available on your HF account. | |
| 4. **Stage 4 — split translate / reply UI + per-turn telemetry + RAG few-shot.** | |
| Both Voice and Text tabs use a 4-box layout: phrasebook translation (text | |
| + audio) is automatic on submit (no LLM), and a separate **Generate reply** | |
| button calls the dialect-anchored LLM for a conversational response. On a | |
| phrasebook miss the LLM is RAG-injected with the top-3 nearest curated | |
| pairs as additional style anchoring. Every turn is appended to | |
| `data/field_turns.jsonl` (`src/engine/turn_logger.py`) with phase, latency | |
| breakdown, phrasebook hit, and reply — the substrate for hit-rate | |
| measurement, A/B comparisons, and eventual Stage-5 LoRA training-data | |
| curation. The system prompt now also explicitly tells the LLM to **reply, | |
| not translate** — the few-shot pairs are framed as style/orthography | |
| references only, fixing the "the LLM just echoes the phrasebook target" | |
| regression. | |
| See `docs/baseline_rebuild.md` for the broader minimal-track plan. | |
| --- | |
| ## Status | |
| | Phase | Feature | State | | |
| |------:|---------|-------| | |
| | 1 | Memory loop (JSONL + HF Hub) | ✅ shipped | | |
| | 2 | Waxal VITS TTS — Bambara | ✅ shipped | | |
| | 2 | Waxal VITS TTS — Fula | ⏳ placeholder until `ous-sow/fula-tts` is trained | | |
| | 3 | Voice-to-voice S2S (F5-TTS + CER) | 🚧 merged, stabilizing | | |
| | — | Adlam ↔ Latin round-trip, per-language prompts | ✅ landed | | |
| See `docs/roadmap_2026-04.md` for the full plan and `docs/baseline_rebuild.md` for the parallel minimal-track strategy. | |
| --- | |
| ## Stack | |
| | Layer | Tool | | |
| |-------|------| | |
| | STT | `openai/whisper-large-v3-turbo` + PEFT LoRA hot-swap (~50 MB adapter per language, ~50 ms switch) | | |
| | LLM | `CohereLabs/aya-expanse-32b` (minimal-baseline default, strong African-language coverage) via HF Serverless InferenceClient — overridable to `Qwen/Qwen2.5-72B-Instruct`, `Qwen2.5-7B-Instruct`, Mistral, Zephyr | | |
| | Dialect anchoring (minimal) | `src/llm/minimal_client.py` — pinned Bambara-Mali / Pular-Guinea system prompt with 30-pair bilingual few-shot + forbidden-drift guardrails | | |
| | Phrasebook short-circuit (minimal) | `src/llm/phrasebook.py` — 100 Bambara + 110 Pular curated gold pairs, fuzzy-matched (0.88 threshold) before any LLM call | | |
| | TTS (baseline) | `facebook/mms-tts-bam`, `facebook/mms-tts-ful` | | |
| | TTS (Bambara) | `ynnov/ekodi-bambara-tts-female` (Waxal VITS) | | |
| | TTS (Fula) | placeholder → `ous-sow/fula-tts` when published | | |
| | Voice cloning | F5-TTS + OpenVoice V2 (Phase 3, GPU-only) | | |
| | Speaker ID | SpeechBrain ECAPA-TDNN, 192-d embeddings, cosine ≥ 0.75 | | |
| | Fast path | RapidFuzz over `data/phrases/{lang}.json` for greetings / thanks / farewells | | |
| | Persistence | JSONL on disk + HF Hub datasets (no ORM) | | |
| | Training | PEFT LoRA + `Seq2SeqTrainer` on FLEURS, Jeli-ASR, SLR 105/106 | | |
| --- | |
| ## Three entry points (do not conflate) | |
| | File | Purpose | Lifecycle | | |
| |------|---------|-----------| | |
| | `app_minimal.py` | **Minimal baseline Gradio UI** — what the HF Space currently serves. Whisper → LLM → MMS-TTS with dialect-pinned prompts + curated phrasebook short-circuit + RAG few-shot on miss + per-turn JSONL telemetry. Tabs: Voice / Text, each with split translation (phrasebook, automatic) and reply (LLM, on demand). | `python app_minimal.py` | | |
| | `app.py` | **Full production Gradio UI** (not currently served on the Space). Single-file (~99 KB) by design. Tabs: Conversation / Teaching / Knowledge Base / Self-Teaching. | `python app.py` | | |
| | `app_lab.py` | **Experimental Gradio UI** for prototyping (e.g. `CuriosityEngine`) before folding into `app.py`. | `python app_lab.py` | | |
| | `src/api/app.py` | **FastAPI service** — loads Whisper once, registers `bam`/`ful` adapters via `AdapterManager`, preloads `bam`, attaches `Transcriber` + `SensorBridge` to `app.state`. | `python scripts/run_server.py` | | |
| --- | |
| ## Repository layout | |
| ``` | |
| app.py # Gradio (production, HF Spaces) | |
| app_lab.py # Gradio (experimental) | |
| requirements.txt # Spaces runtime — do NOT pin torch/torchaudio | |
| packages.txt # apt deps (ffmpeg) | |
| configs/ | |
| base_config.yaml # shared settings | |
| api_config.yaml # FastAPI-specific | |
| lora_bambara.yaml # Bambara LoRA hyperparams | |
| lora_fula.yaml # Fula LoRA hyperparams | |
| data/ | |
| phrases/ # RapidFuzz shortcut phrase JSONs per language | |
| vocabulary.jsonl # local mirror of the HF Hub memory dataset | |
| docs/ | |
| roadmap_2026-04.md # full architectural walkthrough + action plan | |
| baseline_rebuild.md # parallel minimal-track plan (non-destructive) | |
| notebook_collaboration.md # Kaggle push/pull workflow for contributors | |
| kaggle_mcp_setup.md # optional Kaggle MCP for Claude Desktop | |
| notebooks/ | |
| kaggle_master_trainer/ # -> oussow/kaggle-master-trainer (LoRA fine-tune) | |
| train_fula_tts/ # -> oussow/sahel-voice-fula-tts-trainer (TBD) | |
| bootstrap_repos.ipynb | |
| train_colab.ipynb # legacy Colab trainer | |
| scripts/ | |
| train_bambara.py # LoRA fine-tune entrypoint (Kaggle/RunPod) | |
| train_fula.py # LoRA fine-tune entrypoint (Kaggle/RunPod) | |
| export_onnx.py # merge LoRA -> ONNX -> TFLite | |
| verify_baseline.py # eval harness | |
| run_server.py # FastAPI launcher | |
| run_data_pipeline.py # dataset prep | |
| push_to_hf.sh # deploy helpers | |
| push_to_kaggle.sh # deploy helpers | |
| runpod_setup.sh | |
| src/ | |
| api/ # FastAPI app, schemas, routes, middleware | |
| conversation/ # memory_manager, gemma_client, phrase_matcher, intent_parser | |
| data/ # dataset loading + normalization (Adlam, Bambara) | |
| engine/ # adapter_manager, transcriber, stt_processor, curiosity | |
| iot/ # intent_parser, voice_responder, sensor_bridge | |
| llm/ # LLM client wrappers | |
| memory/ # vocabulary persistence | |
| optimization/ # ONNX / quantization helpers | |
| training/ # trainer, callbacks, augmenters | |
| tts/ # mms_tts, waxal_tts, f5_tts, voice_cloner | |
| voice/ # speaker_profiles (ECAPA-TDNN + OpenVoice SE) | |
| tests/ # pytest — api, data pipeline, engine, iot | |
| ``` | |
| --- | |
| ## How the memory loop works | |
| 1. Press **Push-to-Talk** → speak in Bambara, Fula, French, or English. | |
| 2. **Whisper** transcribes. If the language has a LoRA adapter loaded, `AdapterManager` hot-swaps to it (~50 ms). | |
| 3. **Qwen** reads the vocabulary it has learned so far (`MemoryManager.get_vocabulary_context()`), then returns a structured JSON reply with `intent ∈ {teaching, question, conversation, error}`. | |
| 4. If `teaching`: the word pair is appended to `data/vocabulary.jsonl` and async-pushed to `ous-sow/sahel-agri-feedback → vocabulary.jsonl`. | |
| 5. If `question`: Qwen answers using the remembered vocabulary as source of truth. | |
| 6. If `conversation`: Qwen replies naturally. | |
| 7. TTS speaks the reply (Waxal VITS for Bambara, MMS-TTS fallback elsewhere). | |
| The last 5 learned words are always visible in the UI. | |
| --- | |
| ## How the agricultural voice interface works | |
| 1. User asks, e.g., *"A bɛ di wa?"* ("Is it OK?") referring to their field. | |
| 2. `intent_parser.py` (keyword-based) classifies the request: `check_soil` / `check_weather` / `irrigation_status` / `pest_alert` / etc. | |
| 3. `SensorBridge` calls the configured `SENSOR_API_URL` and returns a typed `SensorData`. | |
| 4. `voice_responder.py` maps `(Intent, SensorData)` → a short (≤ 6 words/sentence) Bambara or Fula reply + English translation. Alert thresholds are encoded here (`SOIL_MOISTURE_LOW=30`, `TEMP_HIGH=38`, pH bounds). | |
| 5. TTS speaks the reply. | |
| --- | |
| ## Environment variables | |
| All variables have sensible defaults, so you can boot the Space without any of them — but without `HF_TOKEN` the memory loop cannot push. | |
| ### Core | |
| | Key | Default | Purpose | | |
| |-----|---------|---------| | |
| | `HF_TOKEN` | — | HF write token. Required for Hub push and gated models. | | |
| | `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` | Memory-loop target dataset. | | |
| | `ADAPTER_REPO_ID` | `ous-sow/sahel-agri-adapters` | Published LoRA adapters. | | |
| | `WHISPER_MODEL_ID` | `openai/whisper-large-v3-turbo` | STT base model. | | |
| | `LLM_MODEL_ID` | `CohereLabs/aya-expanse-32b` | LLM via HF Serverless. Override to any HF Serverless-supported model. | | |
| | `LOG_LEVEL` | `INFO` | Standard Python logging level. | | |
| | `DEVICE` | `cuda` (FastAPI) | Torch device for inference. | | |
| ### Adapters & TTS | |
| | Key | Default | | |
| |-----|---------| | |
| | `BAMBARA_ADAPTER_PATH` | `./adapters/bambara` | | |
| | `FULA_ADAPTER_PATH` | `./adapters/fula` | | |
| | `BAMBARA_TTS_REPO` | `ynnov/ekodi-bambara-tts-female` | | |
| | `FULA_TTS_REPO` | `ous-sow/fula-tts` | | |
| ### IoT | |
| | Key | Default | | |
| |-----|---------| | |
| | `SENSOR_API_URL` | *(unset → mock sensor)* | | |
| ### Self-Teaching tab (triggers Kaggle training runs) | |
| | Key | Default | | |
| |-----|---------| | |
| | `KAGGLE_USERNAME` | — | | |
| | `KAGGLE_KEY` | — | | |
| | `KAGGLE_KERNEL_SLUG` | `ous-sow/sahel-voice-master-trainer` *(override in prod to `oussow/kaggle-master-trainer` — the actual Kaggle owner slug)* | | |
| | `AUTO_TRAIN_THRESHOLD` | `50` | | |
| --- | |
| ## Run locally | |
| ```bash | |
| # Minimal baseline (what the Space runs) | |
| pip install -r requirements.txt | |
| python app_minimal.py | |
| # Full production UI (not currently on the Space) | |
| python app.py | |
| # FastAPI service | |
| python scripts/run_server.py | |
| # Experimental lab UI | |
| python app_lab.py | |
| ``` | |
| System-level dependency: **ffmpeg** (see `packages.txt`). | |
| --- | |
| ## Training | |
| LoRA fine-tuning runs on **Kaggle T4** or **RunPod** — not locally. Pick one entrypoint: | |
| | Target | Script | Notebook | | |
| |--------|--------|----------| | |
| | Bambara LoRA | `scripts/train_bambara.py` | `notebooks/kaggle_master_trainer/` | | |
| | Fula LoRA | `scripts/train_fula.py` | `notebooks/kaggle_master_trainer/` | | |
| | Fula TTS | — | `notebooks/train_fula_tts/` *(planned)* | | |
| **Contributor workflow:** edit notebooks locally in `notebooks/<slug>/`, commit with `nbstripout` keeping diffs clean, then `cd notebooks/<slug> && kaggle kernels push` to run on Kaggle GPU. Full walkthrough in `docs/notebook_collaboration.md`. | |
| `docs/kaggle_mcp_setup.md` documents the optional Kaggle MCP for Claude Desktop if you'd rather drive Kaggle from an LLM. | |
| --- | |
| ## Export for edge | |
| ```bash | |
| python scripts/export_onnx.py # merges LoRA into the backbone, exports ONNX | |
| # then onnx-tf → TFLite for Android | |
| ``` | |
| ONNX does not support LoRA hot-swap, so export one file per language. `bitsandbytes` NF4 / 8-bit quantization is available for GPU-constrained deploys but is a training-only dep (not in `requirements.txt`). | |
| --- | |
| ## Tests | |
| ```bash | |
| pytest tests/ | |
| ``` | |
| Covers: FastAPI routes, data pipeline, engine (adapter manager + transcriber), IoT (intent parser + voice responder). | |
| --- | |
| ## Space secrets (HF UI → Settings → Secrets) | |
| At minimum: | |
| | Key | Value | | |
| |-----|-------| | |
| | `HF_TOKEN` | write-scope token | | |
| | `FEEDBACK_REPO_ID` | `ous-sow/sahel-agri-feedback` | | |
| | `LLM_MODEL_ID` | `CohereLabs/aya-expanse-32b` (or any HF Serverless-supported model) | | |
| --- | |
| ## Design constraints (deliberate — do not change without discussion) | |
| - **Adapter hot-swap** via PEFT's multi-adapter API — one backbone in VRAM, ~50 MB adapters per language, `set_adapter` ≈ 50 ms. | |
| - **Qwen "adult-child" JSON contract** — structured `intent`/`reply`/`english`/`teaching_pair` output, parsed out of optional markdown fences. | |
| - **JSONL + Hub push memory** — no ORM, thread-safe `MemoryManager`, async push so UI never blocks. | |
| - **≤ 6 words per sentence** in `voice_responder.py` for clean MMS-TTS. | |
| - **Adlam ↔ Latin dual-script** handling in `adlam.py` + `bam_normalize.py`. | |
| - **Single-file `app.py`** — intentional for now; do not split without a plan. | |
| --- | |
| ## License | |
| MIT. | |