jefffffff9 Claude Opus 4.7 commited on
Commit
9e99c2c
·
1 Parent(s): 064d08b

Stage 4: split translate/reply UI + CPU-safe TTS + reply-not-translate prompt

Browse files

- 4-box layout on both Voice and Text tabs: phrasebook translation (text +
audio) is automatic on submit; "Generate reply" runs the dialect-anchored
LLM only when clicked.
- Shared gr.State carries the canonical input (typed text or Whisper
transcript) into the reply button so we never re-transcribe.
- Robust device resolution: probe cuda.device_count(), and have _synthesize
retry on CPU when CUDA path raises (fixes "Torch not compiled with CUDA"
on CPU-only laptops).
- System prompt now explicitly tells the LLM to REPLY conversationally and
reframes the curated few-shot pairs as style/orthography references only,
fixing the regression where the model would echo the phrasebook target
verbatim instead of replying.
- README: add Stage 4 entry + update entry-points table.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Files changed (3) hide show
  1. README.md +15 -1
  2. app_minimal.py +274 -165
  3. src/llm/minimal_client.py +15 -4
README.md CHANGED
@@ -70,6 +70,20 @@ Three stacked changes land dialect fidelity without any training:
70
  `Qwen/Qwen2.5-72B-Instruct`) if Cohere's inference provider is not
71
  available on your HF account.
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  See `docs/baseline_rebuild.md` for the broader minimal-track plan.
74
 
75
  ---
@@ -111,7 +125,7 @@ See `docs/roadmap_2026-04.md` for the full plan and `docs/baseline_rebuild.md` f
111
 
112
  | File | Purpose | Lifecycle |
113
  |------|---------|-----------|
114
- | `app_minimal.py` | **Minimal baseline Gradio UI** — what the HF Space currently serves. Whisper → LLM → MMS-TTS with dialect-pinned prompts + curated phrasebook short-circuit. Tabs: Voice / Text. | `python app_minimal.py` |
115
  | `app.py` | **Full production Gradio UI** (not currently served on the Space). Single-file (~99 KB) by design. Tabs: Conversation / Teaching / Knowledge Base / Self-Teaching. | `python app.py` |
116
  | `app_lab.py` | **Experimental Gradio UI** for prototyping (e.g. `CuriosityEngine`) before folding into `app.py`. | `python app_lab.py` |
117
  | `src/api/app.py` | **FastAPI service** — loads Whisper once, registers `bam`/`ful` adapters via `AdapterManager`, preloads `bam`, attaches `Transcriber` + `SensorBridge` to `app.state`. | `python scripts/run_server.py` |
 
70
  `Qwen/Qwen2.5-72B-Instruct`) if Cohere's inference provider is not
71
  available on your HF account.
72
 
73
+ 4. **Stage 4 — split translate / reply UI + per-turn telemetry + RAG few-shot.**
74
+ Both Voice and Text tabs use a 4-box layout: phrasebook translation (text
75
+ + audio) is automatic on submit (no LLM), and a separate **Generate reply**
76
+ button calls the dialect-anchored LLM for a conversational response. On a
77
+ phrasebook miss the LLM is RAG-injected with the top-3 nearest curated
78
+ pairs as additional style anchoring. Every turn is appended to
79
+ `data/field_turns.jsonl` (`src/engine/turn_logger.py`) with phase, latency
80
+ breakdown, phrasebook hit, and reply — the substrate for hit-rate
81
+ measurement, A/B comparisons, and eventual Stage-5 LoRA training-data
82
+ curation. The system prompt now also explicitly tells the LLM to **reply,
83
+ not translate** — the few-shot pairs are framed as style/orthography
84
+ references only, fixing the "the LLM just echoes the phrasebook target"
85
+ regression.
86
+
87
  See `docs/baseline_rebuild.md` for the broader minimal-track plan.
88
 
89
  ---
 
125
 
126
  | File | Purpose | Lifecycle |
127
  |------|---------|-----------|
128
+ | `app_minimal.py` | **Minimal baseline Gradio UI** — what the HF Space currently serves. Whisper → LLM → MMS-TTS with dialect-pinned prompts + curated phrasebook short-circuit + RAG few-shot on miss + per-turn JSONL telemetry. Tabs: Voice / Text, each with split translation (phrasebook, automatic) and reply (LLM, on demand). | `python app_minimal.py` |
129
  | `app.py` | **Full production Gradio UI** (not currently served on the Space). Single-file (~99 KB) by design. Tabs: Conversation / Teaching / Knowledge Base / Self-Teaching. | `python app.py` |
130
  | `app_lab.py` | **Experimental Gradio UI** for prototyping (e.g. `CuriosityEngine`) before folding into `app.py`. | `python app_lab.py` |
131
  | `src/api/app.py` | **FastAPI service** — loads Whisper once, registers `bam`/`ful` adapters via `AdapterManager`, preloads `bam`, attaches `Transcriber` + `SensorBridge` to `app.state`. | `python scripts/run_server.py` |
app_minimal.py CHANGED
@@ -79,11 +79,21 @@ _turn_logger: TurnLogger = TurnLogger()
79
 
80
 
81
  def _resolve_device() -> str:
82
- """Pick 'cuda' if torch sees a GPU, else 'cpu'. DEVICE env overrides."""
 
 
 
 
 
83
  import torch # lazy
84
  if _REQUESTED_DEVICE:
85
  return _REQUESTED_DEVICE
86
- return "cuda" if torch.cuda.is_available() else "cpu"
 
 
 
 
 
87
 
88
 
89
  def get_backbone() -> WhisperBackbone:
@@ -167,202 +177,246 @@ def transcribe(audio_np: np.ndarray, sample_rate: int, input_lang: str) -> str:
167
  return transcript
168
 
169
 
170
- def run_pipeline(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  audio: Optional[Tuple[int, np.ndarray]],
172
  input_lang: str,
173
  output_lang: str,
174
- ) -> Tuple[str, str, Optional[Tuple[int, np.ndarray]]]:
175
- """Gradio handler for the Voice tab.
176
 
177
- Args:
178
- audio: (sample_rate, audio_np) from gr.Audio.
179
- input_lang: language of the spoken input (drives Whisper hint + bam_normalize).
180
- output_lang: language the LLM should reply in and the TTS should speak.
181
-
182
- Returns (transcript, reply_text, reply_audio). Graceful degradation: any
183
- stage failure yields a readable string and None audio instead of raising.
184
  """
185
  import time
186
  t0 = time.perf_counter()
187
  if audio is None:
188
- return "", "(no audio received)", None
189
-
190
  sample_rate, audio_np = audio
191
  if audio_np.size == 0:
192
- return "", "(empty audio)", None
193
 
194
- # ── 1. Transcribe ─────────────────────────────────────────────────────
195
  t_stt = time.perf_counter()
196
  try:
197
  transcript = transcribe(audio_np, sample_rate, input_lang)
198
- except Exception as exc: # pragma: no cover — field-safety
199
  logger.exception("Transcription failed")
200
  _turn_logger.log(
201
- tab="voice", input_lang=input_lang, output_lang=output_lang,
 
202
  user_text=None, transcript=None, transcribe_ms=None,
203
  phrasebook=None, llm_model=None, llm_ms=None,
204
  reply_text=None, tts_ms=None,
205
  total_ms=int((time.perf_counter() - t0) * 1000),
206
  error=f"stt: {exc}",
207
  )
208
- return "", f"(STT error: {exc})", None
209
  transcribe_ms = int((time.perf_counter() - t_stt) * 1000)
210
 
211
  if not transcript:
212
  _turn_logger.log(
213
- tab="voice", input_lang=input_lang, output_lang=output_lang,
 
214
  user_text=None, transcript="", transcribe_ms=transcribe_ms,
215
  phrasebook=None, llm_model=None, llm_ms=None,
216
  reply_text=None, tts_ms=None,
217
  total_ms=int((time.perf_counter() - t0) * 1000),
218
  error="no_speech",
219
  )
220
- return "", "(no speech detected)", None
221
-
222
- # ── 2. Phrasebook → LLM (with RAG few-shot on miss) → reply ──────────
223
- reply_text, hit, llm_ms = _resolve_reply(transcript, output_lang)
224
- if reply_text is None:
225
- _turn_logger.log(
226
- tab="voice", input_lang=input_lang, output_lang=output_lang,
227
- user_text=transcript, transcript=transcript,
228
- transcribe_ms=transcribe_ms,
229
- phrasebook=hit, llm_model=LLM_MODEL_ID, llm_ms=llm_ms,
230
- reply_text=None, tts_ms=None,
231
- total_ms=int((time.perf_counter() - t0) * 1000),
232
- error="llm_failed",
233
- )
234
- return transcript, "(LLM error)", None
235
-
236
- # ── 3. TTS ────────────────────────────────────────────────────────────
237
- t_tts = time.perf_counter()
238
- tts_ms: Optional[int] = None
239
- audio_out: Optional[Tuple[int, np.ndarray]] = None
240
- tts_error: Optional[str] = None
241
- try:
242
- wav, sr = get_tts().synthesize(
243
- reply_text, language=output_lang, device=_resolve_device()
244
- )
245
- audio_out = (sr, wav)
246
- tts_ms = int((time.perf_counter() - t_tts) * 1000)
247
- except Exception as exc:
248
- logger.exception("TTS failed")
249
- tts_error = f"tts: {exc}"
250
 
 
251
  _turn_logger.log(
252
- tab="voice", input_lang=input_lang, output_lang=output_lang,
 
253
  user_text=transcript, transcript=transcript,
254
  transcribe_ms=transcribe_ms,
255
- phrasebook=hit,
256
- llm_model=None if hit else LLM_MODEL_ID,
257
- llm_ms=llm_ms,
258
- reply_text=reply_text, tts_ms=tts_ms,
259
  total_ms=int((time.perf_counter() - t0) * 1000),
260
- error=tts_error,
261
  )
262
- return transcript, reply_text, audio_out
263
 
264
 
265
- def run_text_pipeline(
266
- text: str,
267
  output_lang: str,
268
  ) -> Tuple[str, Optional[Tuple[int, np.ndarray]]]:
269
- """Gradio handler for the Text tab.
270
-
271
- Args:
272
- text: typed user input.
273
- output_lang: language the LLM should reply in and the TTS should speak.
274
-
275
- No input-language param — typed input is whatever the user types; the LLM
276
- reads it as-is and replies in `output_lang`. Skips Whisper entirely; this
277
- is the fast dev-loop path.
278
- """
279
  import time
280
  t0 = time.perf_counter()
281
- text = (text or "").strip()
282
- if not text:
283
- return "(no text entered)", None
284
-
285
- reply_text, hit, llm_ms = _resolve_reply(text, output_lang)
286
- if reply_text is None:
287
- _turn_logger.log(
288
- tab="text", input_lang=None, output_lang=output_lang,
289
- user_text=text, transcript=None, transcribe_ms=None,
290
- phrasebook=hit, llm_model=LLM_MODEL_ID, llm_ms=llm_ms,
291
- reply_text=None, tts_ms=None,
292
- total_ms=int((time.perf_counter() - t0) * 1000),
293
- error="llm_failed",
294
- )
295
- return "(LLM error)", None
296
-
297
- t_tts = time.perf_counter()
298
- tts_ms: Optional[int] = None
299
- audio_out: Optional[Tuple[int, np.ndarray]] = None
300
- tts_error: Optional[str] = None
301
- try:
302
- wav, sr = get_tts().synthesize(
303
- reply_text, language=output_lang, device=_resolve_device()
304
- )
305
- audio_out = (sr, wav)
306
- tts_ms = int((time.perf_counter() - t_tts) * 1000)
307
- except Exception as exc:
308
- logger.exception("TTS failed")
309
- tts_error = f"tts: {exc}"
310
 
 
 
 
311
  _turn_logger.log(
312
- tab="text", input_lang=None, output_lang=output_lang,
313
- user_text=text, transcript=None, transcribe_ms=None,
314
- phrasebook=hit,
315
- llm_model=None if hit else LLM_MODEL_ID,
316
- llm_ms=llm_ms,
317
- reply_text=reply_text, tts_ms=tts_ms,
318
  total_ms=int((time.perf_counter() - t0) * 1000),
319
- error=tts_error,
320
  )
321
- return reply_text, audio_out
322
-
323
-
324
- def _resolve_reply(
325
- user_text: str,
326
- output_lang: str,
327
- ) -> Tuple[Optional[str], Optional[dict], Optional[int]]:
328
- """Shared phrasebook → LLM resolver for both voice and text tabs.
329
-
330
- Returns (reply_text, phrasebook_hit_or_None, llm_ms_or_None).
331
- `reply_text` is None only if the LLM itself failed; in every other case
332
- the caller is given a usable string (possibly an "(empty reply)" sentinel).
333
-
334
- On phrasebook miss for bam/ful targets, the top-3 nearest gold pairs are
335
- injected into the LLM system prompt as additional dynamic few-shot
336
- (RAG-style anchoring). Misses on en/fr targets call the LLM with no
337
- extras since the curated phrasebooks only cover bam/ful.
338
- """
339
- import time
340
- hit = phrasebook_lookup(user_text, output_lang)
341
- if hit:
342
- logger.info(
343
- "Phrasebook hit (%s, score=%.2f): %r → %r [cat=%s]",
344
- hit["match"], hit["score"], user_text, hit["target"], hit["category"],
345
- )
346
- reply = hit["target"] or "(empty reply)"
347
- return reply, hit, None
348
-
349
- extras = phrasebook_top_k(user_text, output_lang, k=3) or None
350
- if extras:
351
- logger.info(
352
- "Phrasebook miss; RAG-injecting top-%d nearest (top score=%.2f)",
353
- len(extras), extras[0]["score"],
354
- )
355
-
356
- t_llm = time.perf_counter()
357
- try:
358
- reply = get_llm().chat(
359
- user_text, target_lang=output_lang, extra_examples=extras,
360
- )
361
- except Exception as exc: # pragma: no cover
362
- logger.exception("LLM call failed")
363
- return None, None, int((time.perf_counter() - t_llm) * 1000)
364
- llm_ms = int((time.perf_counter() - t_llm) * 1000)
365
- return (reply or "(empty reply)"), None, llm_ms
366
 
367
 
368
  # ── Gradio UI ────────────────────────────────────────────────────────────────
@@ -393,9 +447,14 @@ def build_ui():
393
  info="Language the LLM should reply in. Also picks the TTS voice.",
394
  )
395
 
 
 
 
 
 
396
  with gr.Tabs():
397
  # ── Voice tab — the actual baseline the field test measures ─────
398
- with gr.Tab("🎤 Voice (full STT → LLM TTS)"):
399
  with gr.Row():
400
  with gr.Column():
401
  audio_in = gr.Audio(
@@ -404,55 +463,100 @@ def build_ui():
404
  label="Speak (or upload a .wav)",
405
  )
406
  voice_submit = gr.Button(
407
- "Transcribe + Reply", variant="primary"
408
  )
409
- with gr.Column():
410
- transcript_out = gr.Textbox(
411
  label="Transcript (zero-shot Whisper)",
412
  lines=2, interactive=False,
413
  )
 
 
 
 
 
 
 
 
 
 
 
 
414
  voice_reply_out = gr.Textbox(
415
  label="LLM reply", lines=4, interactive=False,
416
  )
417
- voice_audio_out = gr.Audio(
418
  label="Reply audio", type="numpy", autoplay=False,
419
  )
420
 
421
  voice_submit.click(
422
- fn=run_pipeline,
423
  inputs=[audio_in, input_lang, output_lang],
424
- outputs=[transcript_out, voice_reply_out, voice_audio_out],
 
 
 
 
 
 
 
 
 
 
425
  )
426
 
427
  # ── Text tab — dev loop, skips Whisper ──────────────────────────
428
- with gr.Tab("⌨️ Text (LLM TTS, dev loop)"):
429
  with gr.Row():
430
  with gr.Column():
431
  text_in = gr.Textbox(
432
  label="Type your message",
433
  lines=3,
434
- placeholder="e.g. I ni ce — how do I say hello in Bambara?",
435
  )
436
  text_submit = gr.Button("Send", variant="primary")
437
  with gr.Column():
 
 
 
 
 
 
 
 
 
 
 
438
  text_reply_out = gr.Textbox(
439
  label="LLM reply", lines=4, interactive=False,
440
  )
441
- text_audio_out = gr.Audio(
442
  label="Reply audio", type="numpy", autoplay=False,
443
  )
444
 
445
  # Text tab only uses output_lang — input_lang is a no-op here.
446
  text_submit.click(
447
- fn=run_text_pipeline,
448
  inputs=[text_in, output_lang],
449
- outputs=[text_reply_out, text_audio_out],
 
 
 
 
450
  )
451
  # Pressing Enter in the textbox also submits.
452
  text_in.submit(
453
- fn=run_text_pipeline,
454
  inputs=[text_in, output_lang],
455
- outputs=[text_reply_out, text_audio_out],
 
 
 
 
 
 
 
 
 
456
  )
457
 
458
  gr.Markdown(
@@ -463,7 +567,12 @@ def build_ui():
463
  "stripped-down baseline used to measure what Whisper zero-shot does on "
464
  "real Bambara/Fula recordings and to collect a real-user eval set.\n\n"
465
  "The **Text** tab skips Whisper — it's for fast iteration on the "
466
- "LLM + TTS path, not for field-test measurement."
 
 
 
 
 
467
  )
468
 
469
  return demo
 
79
 
80
 
81
  def _resolve_device() -> str:
82
+ """Pick 'cuda' if torch sees a GPU, else 'cpu'. DEVICE env overrides.
83
+
84
+ Some torch builds (CPU-only wheels) report `cuda.is_available() == True`
85
+ in error states; we additionally probe device_count and fall back to cpu
86
+ on any exception to keep the app usable on CPU-only laptops.
87
+ """
88
  import torch # lazy
89
  if _REQUESTED_DEVICE:
90
  return _REQUESTED_DEVICE
91
+ try:
92
+ if torch.cuda.is_available() and torch.cuda.device_count() > 0:
93
+ return "cuda"
94
+ except Exception:
95
+ pass
96
+ return "cpu"
97
 
98
 
99
  def get_backbone() -> WhisperBackbone:
 
177
  return transcript
178
 
179
 
180
+ NO_TRANSLATION = "(no curated translation — try Generate reply)"
181
+
182
+
183
+ def _synthesize(text: str, output_lang: str
184
+ ) -> Tuple[Optional[Tuple[int, np.ndarray]], Optional[int], Optional[str]]:
185
+ """Run TTS on `text` in `output_lang`. Returns (audio_or_None, tts_ms, error)."""
186
+ import time
187
+ if not text:
188
+ return None, None, None
189
+ t = time.perf_counter()
190
+ device = _resolve_device()
191
+ try:
192
+ wav, sr = get_tts().synthesize(text, language=output_lang, device=device)
193
+ return (sr, wav), int((time.perf_counter() - t) * 1000), None
194
+ except AssertionError as exc:
195
+ # Most common: "Torch not compiled with CUDA enabled" on CPU-only boxes
196
+ # where is_available() lied. Retry once on CPU.
197
+ if device != "cpu":
198
+ logger.warning("TTS failed on %s (%s) — retrying on cpu", device, exc)
199
+ try:
200
+ wav, sr = get_tts().synthesize(text, language=output_lang, device="cpu")
201
+ return (sr, wav), int((time.perf_counter() - t) * 1000), None
202
+ except Exception as exc2: # pragma: no cover
203
+ logger.exception("TTS failed on cpu fallback")
204
+ return None, None, f"tts: {exc2}"
205
+ logger.exception("TTS failed")
206
+ return None, None, f"tts: {exc}"
207
+ except Exception as exc: # pragma: no cover
208
+ logger.exception("TTS failed")
209
+ return None, None, f"tts: {exc}"
210
+
211
+
212
+ def _translate_only(user_text: str, output_lang: str
213
+ ) -> Tuple[str, Optional[Tuple[int, np.ndarray]], Optional[dict], Optional[int]]:
214
+ """Phrasebook-only translation — never calls the LLM.
215
+
216
+ Returns (translation_text, translation_audio, hit_or_None, tts_ms).
217
+ On miss for bam/ful, returns NO_TRANSLATION and no audio.
218
+ For en/fr targets (no curated phrasebook), echoes the input as the
219
+ translation since the user likely wants to hear it spoken — TTS in that
220
+ language is still the right thing to play.
221
+ """
222
+ text = (user_text or "").strip()
223
+ if not text:
224
+ return "", None, None, None
225
+
226
+ hit = phrasebook_lookup(text, output_lang)
227
+ if hit:
228
+ logger.info(
229
+ "Phrasebook hit (%s, score=%.2f): %r → %r [cat=%s]",
230
+ hit["match"], hit["score"], text, hit["target"], hit["category"],
231
+ )
232
+ target = hit["target"] or ""
233
+ audio, tts_ms, _ = _synthesize(target, output_lang)
234
+ return target, audio, hit, tts_ms
235
+
236
+ # No curated translation. For en/fr we still synthesize the input itself
237
+ # (the user can use the app as a TTS box). For bam/ful we surface the
238
+ # honest "no curated translation" sentinel — the user can then click
239
+ # "Generate reply" if they want the LLM to handle it.
240
+ if output_lang in ("en", "fr"):
241
+ audio, tts_ms, _ = _synthesize(text, output_lang)
242
+ return text, audio, None, tts_ms
243
+ return NO_TRANSLATION, None, None, None
244
+
245
+
246
+ def _generate_reply(user_text: str, output_lang: str
247
+ ) -> Tuple[str, Optional[Tuple[int, np.ndarray]], Optional[int], Optional[int], Optional[str]]:
248
+ """Dialect-anchored LLM reply (with RAG top-3 few-shot) + TTS.
249
+
250
+ Returns (reply_text, reply_audio, llm_ms, tts_ms, error).
251
+ Always returns a usable text string — even on LLM failure it returns a
252
+ short parenthetical so the UI never goes blank.
253
+ """
254
+ import time
255
+ text = (user_text or "").strip()
256
+ if not text:
257
+ return "(nothing to reply to)", None, None, None, None
258
+
259
+ extras = phrasebook_top_k(text, output_lang, k=3) or None
260
+ if extras:
261
+ logger.info(
262
+ "RAG-injecting top-%d nearest phrasebook entries (top score=%.2f)",
263
+ len(extras), extras[0]["score"],
264
+ )
265
+
266
+ t_llm = time.perf_counter()
267
+ try:
268
+ reply = get_llm().chat(
269
+ text, target_lang=output_lang, extra_examples=extras,
270
+ )
271
+ except Exception as exc: # pragma: no cover
272
+ logger.exception("LLM call failed")
273
+ llm_ms = int((time.perf_counter() - t_llm) * 1000)
274
+ return f"(LLM error: {exc})", None, llm_ms, None, f"llm: {exc}"
275
+ llm_ms = int((time.perf_counter() - t_llm) * 1000)
276
+ reply = (reply or "").strip() or "(empty reply)"
277
+ audio, tts_ms, tts_error = _synthesize(reply, output_lang)
278
+ return reply, audio, llm_ms, tts_ms, tts_error
279
+
280
+
281
+ # ── Tab handlers ─────────────────────────────────────────────────────────────
282
+ def run_text_translate(
283
+ text: str,
284
+ output_lang: str,
285
+ ) -> Tuple[str, Optional[Tuple[int, np.ndarray]], str]:
286
+ """Text tab → Send: phrasebook-only translation. Always-on, no LLM.
287
+
288
+ Returns (translation_text, translation_audio, transcript_state).
289
+ `transcript_state` is the canonicalised input passed to the Generate-reply
290
+ button so it doesn't need to re-read the textbox.
291
+ """
292
+ import time
293
+ t0 = time.perf_counter()
294
+ text = (text or "").strip()
295
+ if not text:
296
+ return "(no text entered)", None, ""
297
+
298
+ translation, audio, hit, tts_ms = _translate_only(text, output_lang)
299
+ _turn_logger.log(
300
+ phase="translate", tab="text",
301
+ input_lang=None, output_lang=output_lang,
302
+ user_text=text, transcript=None, transcribe_ms=None,
303
+ phrasebook=hit, llm_model=None, llm_ms=None,
304
+ reply_text=translation, tts_ms=tts_ms,
305
+ total_ms=int((time.perf_counter() - t0) * 1000),
306
+ error=None,
307
+ )
308
+ return translation, audio, text
309
+
310
+
311
+ def run_text_reply(
312
+ transcript_state: str,
313
+ output_lang: str,
314
+ ) -> Tuple[str, Optional[Tuple[int, np.ndarray]]]:
315
+ """Text tab → Generate reply: dialect-anchored LLM + TTS."""
316
+ import time
317
+ t0 = time.perf_counter()
318
+ if not (transcript_state or "").strip():
319
+ return "(send a message first)", None
320
+
321
+ reply, audio, llm_ms, tts_ms, error = _generate_reply(
322
+ transcript_state, output_lang
323
+ )
324
+ _turn_logger.log(
325
+ phase="reply", tab="text",
326
+ input_lang=None, output_lang=output_lang,
327
+ user_text=transcript_state, transcript=None, transcribe_ms=None,
328
+ phrasebook=None, llm_model=LLM_MODEL_ID, llm_ms=llm_ms,
329
+ reply_text=reply, tts_ms=tts_ms,
330
+ total_ms=int((time.perf_counter() - t0) * 1000),
331
+ error=error,
332
+ )
333
+ return reply, audio
334
+
335
+
336
+ def run_voice_translate(
337
  audio: Optional[Tuple[int, np.ndarray]],
338
  input_lang: str,
339
  output_lang: str,
340
+ ) -> Tuple[str, str, Optional[Tuple[int, np.ndarray]], str]:
341
+ """Voice tab Submit: Whisper transcribe + phrasebook-only translation.
342
 
343
+ Returns (transcript, translation_text, translation_audio, transcript_state).
 
 
 
 
 
 
344
  """
345
  import time
346
  t0 = time.perf_counter()
347
  if audio is None:
348
+ return "", "(no audio received)", None, ""
 
349
  sample_rate, audio_np = audio
350
  if audio_np.size == 0:
351
+ return "", "(empty audio)", None, ""
352
 
 
353
  t_stt = time.perf_counter()
354
  try:
355
  transcript = transcribe(audio_np, sample_rate, input_lang)
356
+ except Exception as exc: # pragma: no cover
357
  logger.exception("Transcription failed")
358
  _turn_logger.log(
359
+ phase="translate", tab="voice",
360
+ input_lang=input_lang, output_lang=output_lang,
361
  user_text=None, transcript=None, transcribe_ms=None,
362
  phrasebook=None, llm_model=None, llm_ms=None,
363
  reply_text=None, tts_ms=None,
364
  total_ms=int((time.perf_counter() - t0) * 1000),
365
  error=f"stt: {exc}",
366
  )
367
+ return "", f"(STT error: {exc})", None, ""
368
  transcribe_ms = int((time.perf_counter() - t_stt) * 1000)
369
 
370
  if not transcript:
371
  _turn_logger.log(
372
+ phase="translate", tab="voice",
373
+ input_lang=input_lang, output_lang=output_lang,
374
  user_text=None, transcript="", transcribe_ms=transcribe_ms,
375
  phrasebook=None, llm_model=None, llm_ms=None,
376
  reply_text=None, tts_ms=None,
377
  total_ms=int((time.perf_counter() - t0) * 1000),
378
  error="no_speech",
379
  )
380
+ return "", "(no speech detected)", None, ""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
381
 
382
+ translation, t_audio, hit, tts_ms = _translate_only(transcript, output_lang)
383
  _turn_logger.log(
384
+ phase="translate", tab="voice",
385
+ input_lang=input_lang, output_lang=output_lang,
386
  user_text=transcript, transcript=transcript,
387
  transcribe_ms=transcribe_ms,
388
+ phrasebook=hit, llm_model=None, llm_ms=None,
389
+ reply_text=translation, tts_ms=tts_ms,
 
 
390
  total_ms=int((time.perf_counter() - t0) * 1000),
391
+ error=None,
392
  )
393
+ return transcript, translation, t_audio, transcript
394
 
395
 
396
+ def run_voice_reply(
397
+ transcript_state: str,
398
  output_lang: str,
399
  ) -> Tuple[str, Optional[Tuple[int, np.ndarray]]]:
400
+ """Voice tab Generate reply: uses the stored transcript, no re-Whisper."""
 
 
 
 
 
 
 
 
 
401
  import time
402
  t0 = time.perf_counter()
403
+ if not (transcript_state or "").strip():
404
+ return "(record audio and submit first)", None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
405
 
406
+ reply, audio, llm_ms, tts_ms, error = _generate_reply(
407
+ transcript_state, output_lang
408
+ )
409
  _turn_logger.log(
410
+ phase="reply", tab="voice",
411
+ input_lang=None, output_lang=output_lang,
412
+ user_text=transcript_state, transcript=transcript_state,
413
+ transcribe_ms=None,
414
+ phrasebook=None, llm_model=LLM_MODEL_ID, llm_ms=llm_ms,
415
+ reply_text=reply, tts_ms=tts_ms,
416
  total_ms=int((time.perf_counter() - t0) * 1000),
417
+ error=error,
418
  )
419
+ return reply, audio
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
420
 
421
 
422
  # ── Gradio UI ────────────────────────────────────────────────────────────────
 
447
  info="Language the LLM should reply in. Also picks the TTS voice.",
448
  )
449
 
450
+ # Carries the canonical input (typed text, or Whisper transcript) from
451
+ # Submit/Send into the Generate-reply button so we don't re-transcribe
452
+ # or re-read the textbox.
453
+ transcript_state = gr.State("")
454
+
455
  with gr.Tabs():
456
  # ── Voice tab — the actual baseline the field test measures ─────
457
+ with gr.Tab("🎤 Voice (full STT → translation + optional reply)"):
458
  with gr.Row():
459
  with gr.Column():
460
  audio_in = gr.Audio(
 
463
  label="Speak (or upload a .wav)",
464
  )
465
  voice_submit = gr.Button(
466
+ "Transcribe + translate", variant="primary"
467
  )
468
+ voice_transcript_out = gr.Textbox(
 
469
  label="Transcript (zero-shot Whisper)",
470
  lines=2, interactive=False,
471
  )
472
+ with gr.Column():
473
+ voice_translation_out = gr.Textbox(
474
+ label="Phrasebook translation",
475
+ lines=3, interactive=False,
476
+ )
477
+ voice_translation_audio = gr.Audio(
478
+ label="Translation audio",
479
+ type="numpy", autoplay=False,
480
+ )
481
+ voice_reply_btn = gr.Button(
482
+ "Generate reply (LLM)", variant="secondary"
483
+ )
484
  voice_reply_out = gr.Textbox(
485
  label="LLM reply", lines=4, interactive=False,
486
  )
487
+ voice_reply_audio = gr.Audio(
488
  label="Reply audio", type="numpy", autoplay=False,
489
  )
490
 
491
  voice_submit.click(
492
+ fn=run_voice_translate,
493
  inputs=[audio_in, input_lang, output_lang],
494
+ outputs=[
495
+ voice_transcript_out,
496
+ voice_translation_out,
497
+ voice_translation_audio,
498
+ transcript_state,
499
+ ],
500
+ )
501
+ voice_reply_btn.click(
502
+ fn=run_voice_reply,
503
+ inputs=[transcript_state, output_lang],
504
+ outputs=[voice_reply_out, voice_reply_audio],
505
  )
506
 
507
  # ── Text tab — dev loop, skips Whisper ──────────────────────────
508
+ with gr.Tab("⌨️ Text (translation + optional reply, dev loop)"):
509
  with gr.Row():
510
  with gr.Column():
511
  text_in = gr.Textbox(
512
  label="Type your message",
513
  lines=3,
514
+ placeholder="e.g. Good morning, how are you?",
515
  )
516
  text_submit = gr.Button("Send", variant="primary")
517
  with gr.Column():
518
+ text_translation_out = gr.Textbox(
519
+ label="Phrasebook translation",
520
+ lines=3, interactive=False,
521
+ )
522
+ text_translation_audio = gr.Audio(
523
+ label="Translation audio",
524
+ type="numpy", autoplay=False,
525
+ )
526
+ text_reply_btn = gr.Button(
527
+ "Generate reply (LLM)", variant="secondary"
528
+ )
529
  text_reply_out = gr.Textbox(
530
  label="LLM reply", lines=4, interactive=False,
531
  )
532
+ text_reply_audio = gr.Audio(
533
  label="Reply audio", type="numpy", autoplay=False,
534
  )
535
 
536
  # Text tab only uses output_lang — input_lang is a no-op here.
537
  text_submit.click(
538
+ fn=run_text_translate,
539
  inputs=[text_in, output_lang],
540
+ outputs=[
541
+ text_translation_out,
542
+ text_translation_audio,
543
+ transcript_state,
544
+ ],
545
  )
546
  # Pressing Enter in the textbox also submits.
547
  text_in.submit(
548
+ fn=run_text_translate,
549
  inputs=[text_in, output_lang],
550
+ outputs=[
551
+ text_translation_out,
552
+ text_translation_audio,
553
+ transcript_state,
554
+ ],
555
+ )
556
+ text_reply_btn.click(
557
+ fn=run_text_reply,
558
+ inputs=[transcript_state, output_lang],
559
+ outputs=[text_reply_out, text_reply_audio],
560
  )
561
 
562
  gr.Markdown(
 
567
  "stripped-down baseline used to measure what Whisper zero-shot does on "
568
  "real Bambara/Fula recordings and to collect a real-user eval set.\n\n"
569
  "The **Text** tab skips Whisper — it's for fast iteration on the "
570
+ "LLM + TTS path, not for field-test measurement.\n\n"
571
+ "**How the two boxes differ:** the top pair is a phrasebook lookup "
572
+ "(no LLM, instant, gold-curated translation). If your input isn't "
573
+ "in the curated list you'll see *(no curated translation)* — click "
574
+ "**Generate reply** to get a dialect-anchored LLM response in the "
575
+ "bottom pair."
576
  )
577
 
578
  return demo
src/llm/minimal_client.py CHANGED
@@ -94,6 +94,12 @@ def _build_system_prompt(
94
  lines: list[str] = [
95
  f"You are a warm, concise conversational assistant that replies ONLY in {full}.",
96
  "",
 
 
 
 
 
 
97
  "Output format: plain natural text only. No JSON, no code fences, no "
98
  "markdown, no translations, no romanisation, no explanations. Reply in "
99
  "1–3 short sentences suitable to be read aloud by a text-to-speech voice.",
@@ -114,8 +120,12 @@ def _build_system_prompt(
114
  if anchors:
115
  lines += [
116
  "",
117
- f"Reference phrases in {full} — use this exact orthography, spelling, "
118
- "and dialectal style as your model for every reply:",
 
 
 
 
119
  ]
120
  for item in anchors:
121
  src = item.get("source", "").strip()
@@ -127,8 +137,9 @@ def _build_system_prompt(
127
  lines += [
128
  "",
129
  "Additional reference phrases relevant to the current user input "
130
- f"(curated gold {full} translations — use the same orthography and "
131
- "style):",
 
132
  ]
133
  for item in extra_examples:
134
  src = (item.get("source") or "").strip()
 
94
  lines: list[str] = [
95
  f"You are a warm, concise conversational assistant that replies ONLY in {full}.",
96
  "",
97
+ "Your task is to REPLY to the user's message as a person would in "
98
+ "conversation — NOT to translate it. If the user greets you, greet them "
99
+ "back and ask how they are. If they ask a question, answer it. If they "
100
+ "make a statement, respond appropriately. Never simply repeat or "
101
+ "translate what they said back to them.",
102
+ "",
103
  "Output format: plain natural text only. No JSON, no code fences, no "
104
  "markdown, no translations, no romanisation, no explanations. Reply in "
105
  "1–3 short sentences suitable to be read aloud by a text-to-speech voice.",
 
120
  if anchors:
121
  lines += [
122
  "",
123
+ f"Reference phrases in {full} — these pairs are STYLE/ORTHOGRAPHY "
124
+ "examples ONLY (showing how English/French maps to the correct "
125
+ "dialect). Do NOT treat them as a translation task: when the user "
126
+ "writes one of these source phrases, do not just output its target "
127
+ "verbatim — instead REPLY conversationally in the same dialectal "
128
+ "style:",
129
  ]
130
  for item in anchors:
131
  src = item.get("source", "").strip()
 
137
  lines += [
138
  "",
139
  "Additional reference phrases relevant to the current user input "
140
+ f"(curated gold {full} translations — STYLE references only, not a "
141
+ "translation task; reply conversationally, do not echo the target "
142
+ "verbatim):",
143
  ]
144
  for item in extra_examples:
145
  src = (item.get("source") or "").strip()