Spaces:

MataStrategy
/

ground-zero

Sleeping

jefffffff9 Claude Opus 4.7 commited on 23 days ago

Commit

9e99c2c

1 Parent(s): 064d08b

Stage 4: split translate/reply UI + CPU-safe TTS + reply-not-translate prompt

- 4-box layout on both Voice and Text tabs: phrasebook translation (text +
audio) is automatic on submit; "Generate reply" runs the dialect-anchored
LLM only when clicked.
- Shared gr.State carries the canonical input (typed text or Whisper
transcript) into the reply button so we never re-transcribe.
- Robust device resolution: probe cuda.device_count(), and have _synthesize
retry on CPU when CUDA path raises (fixes "Torch not compiled with CUDA"
on CPU-only laptops).
- System prompt now explicitly tells the LLM to REPLY conversationally and
reframes the curated few-shot pairs as style/orthography references only,
fixing the regression where the model would echo the phrasebook target
verbatim instead of replying.
- README: add Stage 4 entry + update entry-points table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (3) hide show

README.md +15 -1
app_minimal.py +274 -165
src/llm/minimal_client.py +15 -4

README.md CHANGED Viewed

@@ -70,6 +70,20 @@ Three stacked changes land dialect fidelity without any training:
    `Qwen/Qwen2.5-72B-Instruct`) if Cohere's inference provider is not
    available on your HF account.
 See `docs/baseline_rebuild.md` for the broader minimal-track plan.
 ---
@@ -111,7 +125,7 @@ See `docs/roadmap_2026-04.md` for the full plan and `docs/baseline_rebuild.md` f
 | File | Purpose | Lifecycle |
 |------|---------|-----------|
-| `app_minimal.py` | **Minimal baseline Gradio UI** — what the HF Space currently serves. Whisper → LLM → MMS-TTS with dialect-pinned prompts + curated phrasebook short-circuit. Tabs: Voice / Text. | `python app_minimal.py` |
 | `app.py` | **Full production Gradio UI** (not currently served on the Space). Single-file (~99 KB) by design. Tabs: Conversation / Teaching / Knowledge Base / Self-Teaching. | `python app.py` |
 | `app_lab.py` | **Experimental Gradio UI** for prototyping (e.g. `CuriosityEngine`) before folding into `app.py`. | `python app_lab.py` |
 | `src/api/app.py` | **FastAPI service** — loads Whisper once, registers `bam`/`ful` adapters via `AdapterManager`, preloads `bam`, attaches `Transcriber` + `SensorBridge` to `app.state`. | `python scripts/run_server.py` |

    `Qwen/Qwen2.5-72B-Instruct`) if Cohere's inference provider is not
    available on your HF account.
+4. **Stage 4 — split translate / reply UI + per-turn telemetry + RAG few-shot.**
+   Both Voice and Text tabs use a 4-box layout: phrasebook translation (text
+   + audio) is automatic on submit (no LLM), and a separate **Generate reply**
+   button calls the dialect-anchored LLM for a conversational response. On a
+   phrasebook miss the LLM is RAG-injected with the top-3 nearest curated
+   pairs as additional style anchoring. Every turn is appended to
+   `data/field_turns.jsonl` (`src/engine/turn_logger.py`) with phase, latency
+   breakdown, phrasebook hit, and reply — the substrate for hit-rate
+   measurement, A/B comparisons, and eventual Stage-5 LoRA training-data
+   curation. The system prompt now also explicitly tells the LLM to **reply,
+   not translate** — the few-shot pairs are framed as style/orthography
+   references only, fixing the "the LLM just echoes the phrasebook target"
+   regression.
 See `docs/baseline_rebuild.md` for the broader minimal-track plan.
 ---
 | File | Purpose | Lifecycle |
 |------|---------|-----------|
+| `app_minimal.py` | **Minimal baseline Gradio UI** — what the HF Space currently serves. Whisper → LLM → MMS-TTS with dialect-pinned prompts + curated phrasebook short-circuit + RAG few-shot on miss + per-turn JSONL telemetry. Tabs: Voice / Text, each with split translation (phrasebook, automatic) and reply (LLM, on demand). | `python app_minimal.py` |
 | `app.py` | **Full production Gradio UI** (not currently served on the Space). Single-file (~99 KB) by design. Tabs: Conversation / Teaching / Knowledge Base / Self-Teaching. | `python app.py` |
 | `app_lab.py` | **Experimental Gradio UI** for prototyping (e.g. `CuriosityEngine`) before folding into `app.py`. | `python app_lab.py` |
 | `src/api/app.py` | **FastAPI service** — loads Whisper once, registers `bam`/`ful` adapters via `AdapterManager`, preloads `bam`, attaches `Transcriber` + `SensorBridge` to `app.state`. | `python scripts/run_server.py` |

app_minimal.py CHANGED Viewed

@@ -79,11 +79,21 @@ _turn_logger: TurnLogger                = TurnLogger()
 def _resolve_device() -> str:
-    """Pick 'cuda' if torch sees a GPU, else 'cpu'. DEVICE env overrides."""
     import torch  # lazy
     if _REQUESTED_DEVICE:
         return _REQUESTED_DEVICE
-    return "cuda" if torch.cuda.is_available() else "cpu"
 def get_backbone() -> WhisperBackbone:
@@ -167,202 +177,246 @@ def transcribe(audio_np: np.ndarray, sample_rate: int, input_lang: str) -> str:
     return transcript
-def run_pipeline(
     audio: Optional[Tuple[int, np.ndarray]],
     input_lang: str,
     output_lang: str,
-) -> Tuple[str, str, Optional[Tuple[int, np.ndarray]]]:
-    """Gradio handler for the Voice tab.
-    Args:
-        audio:       (sample_rate, audio_np) from gr.Audio.
-        input_lang:  language of the spoken input (drives Whisper hint + bam_normalize).
-        output_lang: language the LLM should reply in and the TTS should speak.
-    Returns (transcript, reply_text, reply_audio). Graceful degradation: any
-    stage failure yields a readable string and None audio instead of raising.
     """
     import time
     t0 = time.perf_counter()
     if audio is None:
-        return "", "(no audio received)", None
     sample_rate, audio_np = audio
     if audio_np.size == 0:
-        return "", "(empty audio)", None
-    # ── 1. Transcribe ─────────────────────────────────────────────────────
     t_stt = time.perf_counter()
     try:
         transcript = transcribe(audio_np, sample_rate, input_lang)
-    except Exception as exc:  # pragma: no cover — field-safety
         logger.exception("Transcription failed")
         _turn_logger.log(
-            tab="voice", input_lang=input_lang, output_lang=output_lang,
             user_text=None, transcript=None, transcribe_ms=None,
             phrasebook=None, llm_model=None, llm_ms=None,
             reply_text=None, tts_ms=None,
             total_ms=int((time.perf_counter() - t0) * 1000),
             error=f"stt: {exc}",
         )
-        return "", f"(STT error: {exc})", None
     transcribe_ms = int((time.perf_counter() - t_stt) * 1000)
     if not transcript:
         _turn_logger.log(
-            tab="voice", input_lang=input_lang, output_lang=output_lang,
             user_text=None, transcript="", transcribe_ms=transcribe_ms,
             phrasebook=None, llm_model=None, llm_ms=None,
             reply_text=None, tts_ms=None,
             total_ms=int((time.perf_counter() - t0) * 1000),
             error="no_speech",
         )
-        return "", "(no speech detected)", None
-    # ── 2. Phrasebook → LLM (with RAG few-shot on miss) → reply ──────────
-    reply_text, hit, llm_ms = _resolve_reply(transcript, output_lang)
-    if reply_text is None:
-        _turn_logger.log(
-            tab="voice", input_lang=input_lang, output_lang=output_lang,
-            user_text=transcript, transcript=transcript,
-            transcribe_ms=transcribe_ms,
-            phrasebook=hit, llm_model=LLM_MODEL_ID, llm_ms=llm_ms,
-            reply_text=None, tts_ms=None,
-            total_ms=int((time.perf_counter() - t0) * 1000),
-            error="llm_failed",
-        )
-        return transcript, "(LLM error)", None
-    # ── 3. TTS ────────────────────────────────────────────────────────────
-    t_tts = time.perf_counter()
-    tts_ms: Optional[int] = None
-    audio_out: Optional[Tuple[int, np.ndarray]] = None
-    tts_error: Optional[str] = None
-    try:
-        wav, sr = get_tts().synthesize(
-            reply_text, language=output_lang, device=_resolve_device()
-        )
-        audio_out = (sr, wav)
-        tts_ms = int((time.perf_counter() - t_tts) * 1000)
-    except Exception as exc:
-        logger.exception("TTS failed")
-        tts_error = f"tts: {exc}"
     _turn_logger.log(
-        tab="voice", input_lang=input_lang, output_lang=output_lang,
         user_text=transcript, transcript=transcript,
         transcribe_ms=transcribe_ms,
-        phrasebook=hit,
-        llm_model=None if hit else LLM_MODEL_ID,
-        llm_ms=llm_ms,
-        reply_text=reply_text, tts_ms=tts_ms,
         total_ms=int((time.perf_counter() - t0) * 1000),
-        error=tts_error,
     )
-    return transcript, reply_text, audio_out
-def run_text_pipeline(
-    text: str,
     output_lang: str,
 ) -> Tuple[str, Optional[Tuple[int, np.ndarray]]]:
-    """Gradio handler for the Text tab.
-    Args:
-        text:        typed user input.
-        output_lang: language the LLM should reply in and the TTS should speak.
-    No input-language param — typed input is whatever the user types; the LLM
-    reads it as-is and replies in `output_lang`. Skips Whisper entirely; this
-    is the fast dev-loop path.
-    """
     import time
     t0 = time.perf_counter()
-    text = (text or "").strip()
-    if not text:
-        return "(no text entered)", None
-    reply_text, hit, llm_ms = _resolve_reply(text, output_lang)
-    if reply_text is None:
-        _turn_logger.log(
-            tab="text", input_lang=None, output_lang=output_lang,
-            user_text=text, transcript=None, transcribe_ms=None,
-            phrasebook=hit, llm_model=LLM_MODEL_ID, llm_ms=llm_ms,
-            reply_text=None, tts_ms=None,
-            total_ms=int((time.perf_counter() - t0) * 1000),
-            error="llm_failed",
-        )
-        return "(LLM error)", None
-    t_tts = time.perf_counter()
-    tts_ms: Optional[int] = None
-    audio_out: Optional[Tuple[int, np.ndarray]] = None
-    tts_error: Optional[str] = None
-    try:
-        wav, sr = get_tts().synthesize(
-            reply_text, language=output_lang, device=_resolve_device()
-        )
-        audio_out = (sr, wav)
-        tts_ms = int((time.perf_counter() - t_tts) * 1000)
-    except Exception as exc:
-        logger.exception("TTS failed")
-        tts_error = f"tts: {exc}"
     _turn_logger.log(
-        tab="text", input_lang=None, output_lang=output_lang,
-        user_text=text, transcript=None, transcribe_ms=None,
-        phrasebook=hit,
-        llm_model=None if hit else LLM_MODEL_ID,
-        llm_ms=llm_ms,
-        reply_text=reply_text, tts_ms=tts_ms,
         total_ms=int((time.perf_counter() - t0) * 1000),
-        error=tts_error,
     )
-    return reply_text, audio_out
-def _resolve_reply(
-    user_text: str,
-    output_lang: str,
-) -> Tuple[Optional[str], Optional[dict], Optional[int]]:
-    """Shared phrasebook → LLM resolver for both voice and text tabs.
-    Returns (reply_text, phrasebook_hit_or_None, llm_ms_or_None).
-    `reply_text` is None only if the LLM itself failed; in every other case
-    the caller is given a usable string (possibly an "(empty reply)" sentinel).
-    On phrasebook miss for bam/ful targets, the top-3 nearest gold pairs are
-    injected into the LLM system prompt as additional dynamic few-shot
-    (RAG-style anchoring). Misses on en/fr targets call the LLM with no
-    extras since the curated phrasebooks only cover bam/ful.
-    """
-    import time
-    hit = phrasebook_lookup(user_text, output_lang)
-    if hit:
-        logger.info(
-            "Phrasebook hit (%s, score=%.2f): %r → %r [cat=%s]",
-            hit["match"], hit["score"], user_text, hit["target"], hit["category"],
-        )
-        reply = hit["target"] or "(empty reply)"
-        return reply, hit, None
-    extras = phrasebook_top_k(user_text, output_lang, k=3) or None
-    if extras:
-        logger.info(
-            "Phrasebook miss; RAG-injecting top-%d nearest (top score=%.2f)",
-            len(extras), extras[0]["score"],
-        )
-    t_llm = time.perf_counter()
-    try:
-        reply = get_llm().chat(
-            user_text, target_lang=output_lang, extra_examples=extras,
-        )
-    except Exception as exc:  # pragma: no cover
-        logger.exception("LLM call failed")
-        return None, None, int((time.perf_counter() - t_llm) * 1000)
-    llm_ms = int((time.perf_counter() - t_llm) * 1000)
-    return (reply or "(empty reply)"), None, llm_ms
 # ── Gradio UI ────────────────────────────────────────────────────────────────
@@ -393,9 +447,14 @@ def build_ui():
                 info="Language the LLM should reply in. Also picks the TTS voice.",
             )
         with gr.Tabs():
             # ── Voice tab — the actual baseline the field test measures ─────
-            with gr.Tab("🎤 Voice (full STT → LLM → TTS)"):
                 with gr.Row():
                     with gr.Column():
                         audio_in = gr.Audio(
@@ -404,55 +463,100 @@ def build_ui():
                             label="Speak (or upload a .wav)",
                         )
                         voice_submit = gr.Button(
-                            "Transcribe + Reply", variant="primary"
                         )
-                    with gr.Column():
-                        transcript_out = gr.Textbox(
                             label="Transcript (zero-shot Whisper)",
                             lines=2, interactive=False,
                         )
                         voice_reply_out = gr.Textbox(
                             label="LLM reply", lines=4, interactive=False,
                         )
-                        voice_audio_out = gr.Audio(
                             label="Reply audio", type="numpy", autoplay=False,
                         )
                 voice_submit.click(
-                    fn=run_pipeline,
                     inputs=[audio_in, input_lang, output_lang],
-                    outputs=[transcript_out, voice_reply_out, voice_audio_out],
                 )
             # ── Text tab — dev loop, skips Whisper ──────────────────────────
-            with gr.Tab("⌨️ Text (LLM → TTS, dev loop)"):
                 with gr.Row():
                     with gr.Column():
                         text_in = gr.Textbox(
                             label="Type your message",
                             lines=3,
-                            placeholder="e.g. I ni ce — how do I say hello in Bambara?",
                         )
                         text_submit = gr.Button("Send", variant="primary")
                     with gr.Column():
                         text_reply_out = gr.Textbox(
                             label="LLM reply", lines=4, interactive=False,
                         )
-                        text_audio_out = gr.Audio(
                             label="Reply audio", type="numpy", autoplay=False,
                         )
                 # Text tab only uses output_lang — input_lang is a no-op here.
                 text_submit.click(
-                    fn=run_text_pipeline,
                     inputs=[text_in, output_lang],
-                    outputs=[text_reply_out, text_audio_out],
                 )
                 # Pressing Enter in the textbox also submits.
                 text_in.submit(
-                    fn=run_text_pipeline,
                     inputs=[text_in, output_lang],
-                    outputs=[text_reply_out, text_audio_out],
                 )
         gr.Markdown(
@@ -463,7 +567,12 @@ def build_ui():
             "stripped-down baseline used to measure what Whisper zero-shot does on "
             "real Bambara/Fula recordings and to collect a real-user eval set.\n\n"
             "The **Text** tab skips Whisper — it's for fast iteration on the "
-            "LLM + TTS path, not for field-test measurement."
         )
     return demo

 def _resolve_device() -> str:
+    """Pick 'cuda' if torch sees a GPU, else 'cpu'. DEVICE env overrides.
+    Some torch builds (CPU-only wheels) report `cuda.is_available() == True`
+    in error states; we additionally probe device_count and fall back to cpu
+    on any exception to keep the app usable on CPU-only laptops.
+    """
     import torch  # lazy
     if _REQUESTED_DEVICE:
         return _REQUESTED_DEVICE
+    try:
+        if torch.cuda.is_available() and torch.cuda.device_count() > 0:
+            return "cuda"
+    except Exception:
+        pass
+    return "cpu"
 def get_backbone() -> WhisperBackbone:
     return transcript
+NO_TRANSLATION = "(no curated translation — try Generate reply)"
+def _synthesize(text: str, output_lang: str
+                ) -> Tuple[Optional[Tuple[int, np.ndarray]], Optional[int], Optional[str]]:
+    """Run TTS on `text` in `output_lang`. Returns (audio_or_None, tts_ms, error)."""
+    import time
+    if not text:
+        return None, None, None
+    t = time.perf_counter()
+    device = _resolve_device()
+    try:
+        wav, sr = get_tts().synthesize(text, language=output_lang, device=device)
+        return (sr, wav), int((time.perf_counter() - t) * 1000), None
+    except AssertionError as exc:
+        # Most common: "Torch not compiled with CUDA enabled" on CPU-only boxes
+        # where is_available() lied. Retry once on CPU.
+        if device != "cpu":
+            logger.warning("TTS failed on %s (%s) — retrying on cpu", device, exc)
+            try:
+                wav, sr = get_tts().synthesize(text, language=output_lang, device="cpu")
+                return (sr, wav), int((time.perf_counter() - t) * 1000), None
+            except Exception as exc2:  # pragma: no cover
+                logger.exception("TTS failed on cpu fallback")
+                return None, None, f"tts: {exc2}"
+        logger.exception("TTS failed")
+        return None, None, f"tts: {exc}"
+    except Exception as exc:  # pragma: no cover
+        logger.exception("TTS failed")
+        return None, None, f"tts: {exc}"
+def _translate_only(user_text: str, output_lang: str
+                    ) -> Tuple[str, Optional[Tuple[int, np.ndarray]], Optional[dict], Optional[int]]:
+    """Phrasebook-only translation — never calls the LLM.
+    Returns (translation_text, translation_audio, hit_or_None, tts_ms).
+    On miss for bam/ful, returns NO_TRANSLATION and no audio.
+    For en/fr targets (no curated phrasebook), echoes the input as the
+    translation since the user likely wants to hear it spoken — TTS in that
+    language is still the right thing to play.
+    """
+    text = (user_text or "").strip()
+    if not text:
+        return "", None, None, None
+    hit = phrasebook_lookup(text, output_lang)
+    if hit:
+        logger.info(
+            "Phrasebook hit (%s, score=%.2f): %r → %r [cat=%s]",
+            hit["match"], hit["score"], text, hit["target"], hit["category"],
+        )
+        target = hit["target"] or ""
+        audio, tts_ms, _ = _synthesize(target, output_lang)
+        return target, audio, hit, tts_ms
+    # No curated translation. For en/fr we still synthesize the input itself
+    # (the user can use the app as a TTS box). For bam/ful we surface the
+    # honest "no curated translation" sentinel — the user can then click
+    # "Generate reply" if they want the LLM to handle it.
+    if output_lang in ("en", "fr"):
+        audio, tts_ms, _ = _synthesize(text, output_lang)
+        return text, audio, None, tts_ms
+    return NO_TRANSLATION, None, None, None
+def _generate_reply(user_text: str, output_lang: str
+                    ) -> Tuple[str, Optional[Tuple[int, np.ndarray]], Optional[int], Optional[int], Optional[str]]:
+    """Dialect-anchored LLM reply (with RAG top-3 few-shot) + TTS.
+    Returns (reply_text, reply_audio, llm_ms, tts_ms, error).
+    Always returns a usable text string — even on LLM failure it returns a
+    short parenthetical so the UI never goes blank.
+    """
+    import time
+    text = (user_text or "").strip()
+    if not text:
+        return "(nothing to reply to)", None, None, None, None
+    extras = phrasebook_top_k(text, output_lang, k=3) or None
+    if extras:
+        logger.info(
+            "RAG-injecting top-%d nearest phrasebook entries (top score=%.2f)",
+            len(extras), extras[0]["score"],
+        )
+    t_llm = time.perf_counter()
+    try:
+        reply = get_llm().chat(
+            text, target_lang=output_lang, extra_examples=extras,
+        )
+    except Exception as exc:  # pragma: no cover
+        logger.exception("LLM call failed")
+        llm_ms = int((time.perf_counter() - t_llm) * 1000)
+        return f"(LLM error: {exc})", None, llm_ms, None, f"llm: {exc}"
+    llm_ms = int((time.perf_counter() - t_llm) * 1000)
+    reply = (reply or "").strip() or "(empty reply)"
+    audio, tts_ms, tts_error = _synthesize(reply, output_lang)
+    return reply, audio, llm_ms, tts_ms, tts_error
+# ── Tab handlers ─────────────────────────────────────────────────────────────
+def run_text_translate(
+    text: str,
+    output_lang: str,
+) -> Tuple[str, Optional[Tuple[int, np.ndarray]], str]:
+    """Text tab → Send: phrasebook-only translation. Always-on, no LLM.
+    Returns (translation_text, translation_audio, transcript_state).
+    `transcript_state` is the canonicalised input passed to the Generate-reply
+    button so it doesn't need to re-read the textbox.
+    """
+    import time
+    t0 = time.perf_counter()
+    text = (text or "").strip()
+    if not text:
+        return "(no text entered)", None, ""
+    translation, audio, hit, tts_ms = _translate_only(text, output_lang)
+    _turn_logger.log(
+        phase="translate", tab="text",
+        input_lang=None, output_lang=output_lang,
+        user_text=text, transcript=None, transcribe_ms=None,
+        phrasebook=hit, llm_model=None, llm_ms=None,
+        reply_text=translation, tts_ms=tts_ms,
+        total_ms=int((time.perf_counter() - t0) * 1000),
+        error=None,
+    )
+    return translation, audio, text
+def run_text_reply(
+    transcript_state: str,
+    output_lang: str,
+) -> Tuple[str, Optional[Tuple[int, np.ndarray]]]:
+    """Text tab → Generate reply: dialect-anchored LLM + TTS."""
+    import time
+    t0 = time.perf_counter()
+    if not (transcript_state or "").strip():
+        return "(send a message first)", None
+    reply, audio, llm_ms, tts_ms, error = _generate_reply(
+        transcript_state, output_lang
+    )
+    _turn_logger.log(
+        phase="reply", tab="text",
+        input_lang=None, output_lang=output_lang,
+        user_text=transcript_state, transcript=None, transcribe_ms=None,
+        phrasebook=None, llm_model=LLM_MODEL_ID, llm_ms=llm_ms,
+        reply_text=reply, tts_ms=tts_ms,
+        total_ms=int((time.perf_counter() - t0) * 1000),
+        error=error,
+    )
+    return reply, audio
+def run_voice_translate(
     audio: Optional[Tuple[int, np.ndarray]],
     input_lang: str,
     output_lang: str,
+) -> Tuple[str, str, Optional[Tuple[int, np.ndarray]], str]:
+    """Voice tab → Submit: Whisper transcribe + phrasebook-only translation.
+    Returns (transcript, translation_text, translation_audio, transcript_state).
     """
     import time
     t0 = time.perf_counter()
     if audio is None:
+        return "", "(no audio received)", None, ""
     sample_rate, audio_np = audio
     if audio_np.size == 0:
+        return "", "(empty audio)", None, ""
     t_stt = time.perf_counter()
     try:
         transcript = transcribe(audio_np, sample_rate, input_lang)
+    except Exception as exc:  # pragma: no cover
         logger.exception("Transcription failed")
         _turn_logger.log(
+            phase="translate", tab="voice",
+            input_lang=input_lang, output_lang=output_lang,
             user_text=None, transcript=None, transcribe_ms=None,
             phrasebook=None, llm_model=None, llm_ms=None,
             reply_text=None, tts_ms=None,
             total_ms=int((time.perf_counter() - t0) * 1000),
             error=f"stt: {exc}",
         )
+        return "", f"(STT error: {exc})", None, ""
     transcribe_ms = int((time.perf_counter() - t_stt) * 1000)
     if not transcript:
         _turn_logger.log(
+            phase="translate", tab="voice",
+            input_lang=input_lang, output_lang=output_lang,
             user_text=None, transcript="", transcribe_ms=transcribe_ms,
             phrasebook=None, llm_model=None, llm_ms=None,
             reply_text=None, tts_ms=None,
             total_ms=int((time.perf_counter() - t0) * 1000),
             error="no_speech",
         )
+        return "", "(no speech detected)", None, ""
+    translation, t_audio, hit, tts_ms = _translate_only(transcript, output_lang)
     _turn_logger.log(
+        phase="translate", tab="voice",
+        input_lang=input_lang, output_lang=output_lang,
         user_text=transcript, transcript=transcript,
         transcribe_ms=transcribe_ms,
+        phrasebook=hit, llm_model=None, llm_ms=None,
+        reply_text=translation, tts_ms=tts_ms,
         total_ms=int((time.perf_counter() - t0) * 1000),
+        error=None,
     )
+    return transcript, translation, t_audio, transcript
+def run_voice_reply(
+    transcript_state: str,
     output_lang: str,
 ) -> Tuple[str, Optional[Tuple[int, np.ndarray]]]:
+    """Voice tab → Generate reply: uses the stored transcript, no re-Whisper."""
     import time
     t0 = time.perf_counter()
+    if not (transcript_state or "").strip():
+        return "(record audio and submit first)", None
+    reply, audio, llm_ms, tts_ms, error = _generate_reply(
+        transcript_state, output_lang
+    )
     _turn_logger.log(
+        phase="reply", tab="voice",
+        input_lang=None, output_lang=output_lang,
+        user_text=transcript_state, transcript=transcript_state,
+        transcribe_ms=None,
+        phrasebook=None, llm_model=LLM_MODEL_ID, llm_ms=llm_ms,
+        reply_text=reply, tts_ms=tts_ms,
         total_ms=int((time.perf_counter() - t0) * 1000),
+        error=error,
     )
+    return reply, audio
 # ── Gradio UI ────────────────────────────────────────────────────────────────
                 info="Language the LLM should reply in. Also picks the TTS voice.",
             )
+        # Carries the canonical input (typed text, or Whisper transcript) from
+        # Submit/Send into the Generate-reply button so we don't re-transcribe
+        # or re-read the textbox.
+        transcript_state = gr.State("")
         with gr.Tabs():
             # ── Voice tab — the actual baseline the field test measures ─────
+            with gr.Tab("🎤 Voice (full STT → translation + optional reply)"):
                 with gr.Row():
                     with gr.Column():
                         audio_in = gr.Audio(
                             label="Speak (or upload a .wav)",
                         )
                         voice_submit = gr.Button(
+                            "Transcribe + translate", variant="primary"
                         )
+                        voice_transcript_out = gr.Textbox(
                             label="Transcript (zero-shot Whisper)",
                             lines=2, interactive=False,
                         )
+                    with gr.Column():
+                        voice_translation_out = gr.Textbox(
+                            label="Phrasebook translation",
+                            lines=3, interactive=False,
+                        )
+                        voice_translation_audio = gr.Audio(
+                            label="Translation audio",
+                            type="numpy", autoplay=False,
+                        )
+                        voice_reply_btn = gr.Button(
+                            "Generate reply (LLM)", variant="secondary"
+                        )
                         voice_reply_out = gr.Textbox(
                             label="LLM reply", lines=4, interactive=False,
                         )
+                        voice_reply_audio = gr.Audio(
                             label="Reply audio", type="numpy", autoplay=False,
                         )
                 voice_submit.click(
+                    fn=run_voice_translate,
                     inputs=[audio_in, input_lang, output_lang],
+                    outputs=[
+                        voice_transcript_out,
+                        voice_translation_out,
+                        voice_translation_audio,
+                        transcript_state,
+                    ],
+                )
+                voice_reply_btn.click(
+                    fn=run_voice_reply,
+                    inputs=[transcript_state, output_lang],
+                    outputs=[voice_reply_out, voice_reply_audio],
                 )
             # ── Text tab — dev loop, skips Whisper ──────────────────────────
+            with gr.Tab("⌨️ Text (translation + optional reply, dev loop)"):
                 with gr.Row():
                     with gr.Column():
                         text_in = gr.Textbox(
                             label="Type your message",
                             lines=3,
+                            placeholder="e.g. Good morning, how are you?",
                         )
                         text_submit = gr.Button("Send", variant="primary")
                     with gr.Column():
+                        text_translation_out = gr.Textbox(
+                            label="Phrasebook translation",
+                            lines=3, interactive=False,
+                        )
+                        text_translation_audio = gr.Audio(
+                            label="Translation audio",
+                            type="numpy", autoplay=False,
+                        )
+                        text_reply_btn = gr.Button(
+                            "Generate reply (LLM)", variant="secondary"
+                        )
                         text_reply_out = gr.Textbox(
                             label="LLM reply", lines=4, interactive=False,
                         )
+                        text_reply_audio = gr.Audio(
                             label="Reply audio", type="numpy", autoplay=False,
                         )
                 # Text tab only uses output_lang — input_lang is a no-op here.
                 text_submit.click(
+                    fn=run_text_translate,
                     inputs=[text_in, output_lang],
+                    outputs=[
+                        text_translation_out,
+                        text_translation_audio,
+                        transcript_state,
+                    ],
                 )
                 # Pressing Enter in the textbox also submits.
                 text_in.submit(
+                    fn=run_text_translate,
                     inputs=[text_in, output_lang],
+                    outputs=[
+                        text_translation_out,
+                        text_translation_audio,
+                        transcript_state,
+                    ],
+                )
+                text_reply_btn.click(
+                    fn=run_text_reply,
+                    inputs=[transcript_state, output_lang],
+                    outputs=[text_reply_out, text_reply_audio],
                 )
         gr.Markdown(
             "stripped-down baseline used to measure what Whisper zero-shot does on "
             "real Bambara/Fula recordings and to collect a real-user eval set.\n\n"
             "The **Text** tab skips Whisper — it's for fast iteration on the "
+            "LLM + TTS path, not for field-test measurement.\n\n"
+            "**How the two boxes differ:** the top pair is a phrasebook lookup "
+            "(no LLM, instant, gold-curated translation). If your input isn't "
+            "in the curated list you'll see *(no curated translation)* — click "
+            "**Generate reply** to get a dialect-anchored LLM response in the "
+            "bottom pair."
         )
     return demo

src/llm/minimal_client.py CHANGED Viewed

@@ -94,6 +94,12 @@ def _build_system_prompt(
     lines: list[str] = [
         f"You are a warm, concise conversational assistant that replies ONLY in {full}.",
         "",
         "Output format: plain natural text only. No JSON, no code fences, no "
         "markdown, no translations, no romanisation, no explanations. Reply in "
         "1–3 short sentences suitable to be read aloud by a text-to-speech voice.",
@@ -114,8 +120,12 @@ def _build_system_prompt(
     if anchors:
         lines += [
             "",
-            f"Reference phrases in {full} — use this exact orthography, spelling, "
-            "and dialectal style as your model for every reply:",
         ]
         for item in anchors:
             src = item.get("source", "").strip()
@@ -127,8 +137,9 @@ def _build_system_prompt(
         lines += [
             "",
             "Additional reference phrases relevant to the current user input "
-            f"(curated gold {full} translations — use the same orthography and "
-            "style):",
         ]
         for item in extra_examples:
             src = (item.get("source") or "").strip()

     lines: list[str] = [
         f"You are a warm, concise conversational assistant that replies ONLY in {full}.",
         "",
+        "Your task is to REPLY to the user's message as a person would in "
+        "conversation — NOT to translate it. If the user greets you, greet them "
+        "back and ask how they are. If they ask a question, answer it. If they "
+        "make a statement, respond appropriately. Never simply repeat or "
+        "translate what they said back to them.",
+        "",
         "Output format: plain natural text only. No JSON, no code fences, no "
         "markdown, no translations, no romanisation, no explanations. Reply in "
         "1–3 short sentences suitable to be read aloud by a text-to-speech voice.",
     if anchors:
         lines += [
             "",
+            f"Reference phrases in {full} — these pairs are STYLE/ORTHOGRAPHY "
+            "examples ONLY (showing how English/French maps to the correct "
+            "dialect). Do NOT treat them as a translation task: when the user "
+            "writes one of these source phrases, do not just output its target "
+            "verbatim — instead REPLY conversationally in the same dialectal "
+            "style:",
         ]
         for item in anchors:
             src = item.get("source", "").strip()
         lines += [
             "",
             "Additional reference phrases relevant to the current user input "
+            f"(curated gold {full} translations — STYLE references only, not a "
+            "translation task; reply conversationally, do not echo the target "
+            "verbatim):",
         ]
         for item in extra_examples:
             src = (item.get("source") or "").strip()