Spaces:

joemartis
/

Video2Guide

Running

File size: 23,762 Bytes

# Video Guide Maker — User Guide

Turn a recorded lecture into an accessible, navigable study guide. Hand it a video and a transcript; it picks the meaningful moments (slide changes, board work, demos), aligns narration to each one, and produces a polished HTML document you can edit before publishing.

---

## Table of contents

1. [Quickstart](#quickstart)
2. [The two paths](#the-two-paths)
3. [Path A: AI topic-based guide (recommended)](#path-a-ai-topic-based-guide-recommended)
4. [Path B: Slide-by-slide scene picker (no API key)](#path-b-slide-by-slide-scene-picker-no-api-key)
5. [The Review & Edit screen](#the-review--edit-screen)
6. [Output formats](#output-formats)
7. [AI assist features](#ai-assist-features)
8. [Advanced settings](#advanced-settings)
9. [Tips for the best results](#tips-for-the-best-results)
10. [Troubleshooting](#troubleshooting)
11. [Glossary](#glossary)

---

## Quickstart

The five-minute version. Everything else in this guide is detail.

1. **Open the site.**
2. **Upload your video** (drag it onto the drop zone, or click to browse).
3. **Upload your transcript** (`.srt` or `.vtt`).
4. **Paste your Anthropic API key** in the field below the transcript. (Don't have one? See [Path B](#path-b-slide-by-slide-scene-picker-no-api-key).)
5. **Click "Preview topic scenes (AI)"** — wait ~30 seconds per minute of video. You'll see Claude's proposed topic breakdown with a thumbnail for each.
6. **Uncheck any topics you don't want.** Then click **Generate study guide →**.
7. The editor opens with one section per topic. Edit anything that needs polish. Click **Download final HTML** when done.

That's it. The rest of this guide explains each step in depth.

---

## The two paths

There are two ways to produce a guide, and the upload page nudges you toward whichever is appropriate based on whether you provide an Anthropic API key.

**With an API key — Path A (AI topic-based):**

The recommended path. Claude reads your transcript, identifies natural topic boundaries (Introduction, Background, Method, Examples, Conclusion…), and produces a guide with one section per topic instead of one section per slide. Fewer, more meaningful sections; the narration flows naturally instead of being chopped at every visual change.

**Without an API key — Path B (slide-by-slide):**

The classic path. The app uses computer vision to detect every slide change, runs Tesseract OCR locally on each slide, and produces a guide with one section per slide. Narration is bucketed to whichever section's time range it falls into. Works fully offline (no Anthropic calls), but produces a long guide with many short sections.

Both paths share the same preview-and-curate UI: you see what will be in your guide before paying the time / API cost of the final extraction.

---

## Path A: AI topic-based guide (recommended)

### What you'll need

| Item | Notes |
|---|---|
| **Lecture video** | `.mp4`, `.mov`, `.mkv`, `.webm`, `.m4v`. Up to 500 MB. |
| **Transcript** | `.srt` or `.vtt` (the captions file from your video). Up to 5 MB. |
| **Anthropic API key** | Starts with `sk-ant-`. Get one at [console.anthropic.com](https://console.anthropic.com). Cost is roughly $0.05–0.30 per video at default settings. |

### Step-by-step

**1. Fill in the basics.**

- **Study guide title** — pre-filled with "Lecture Study Guide"; change it to your lecture's title.
- **Video file** — drag the file onto the drop zone or click to browse.
- **Transcript** — same. SRT or VTT.
- **Anthropic API key** — paste it here. The key is held in memory for this one run only, sent to Anthropic over HTTPS, and never logged or persisted to disk.

The moment you paste a key, the **Topic-based segmentation** toggle in the AI assist section automatically checks itself (with a small "Auto-enabled" note explaining what's happening). You can uncheck it if you'd rather get the per-slide breakdown.

**2. (Optional) Pick a video type.**

Open the "Topic-based segmentation" toggle if it isn't already visible. Below it, a small "Video type" dropdown lets you tell Claude what kind of video this is:

- **General** — no genre hint. Claude infers structure from the transcript alone.
- **Classroom lecture** — biases boundaries toward intro → background → concept(s) → examples → wrap-up.
- **Lab walkthrough / experiment** — biases toward intro → theory → materials → procedure → results.
- **How-to tutorial / screencast** — biases toward intro → setup → steps → verification → troubleshooting.
- **Interview, panel, or Q&A** — boundaries at new questions or major subject pivots.
- **Talk or presentation** — biases toward hook → thesis → supporting points → closer.

The hint adds one paragraph to Claude's system prompt. Pick the closest match; it noticeably improves the boundary placement on structured content.

**3. (Optional) Also send slide images.**

Underneath the topic toggle there's an indented sub-toggle: **"Also send slide images"**. When checked, Claude additionally receives a thumbnail of every detected slide alongside the transcript. Slower (5–10× the tokens, ~10–30 seconds instead of ~5–10) but meaningfully better on lectures with visual chapter markers (splash slides, "Module 3" title cards). Capped at 60 slides per call.

Leave it off if your transcript is well-punctuated and your lectures are narration-driven. Turn it on if your slides have strong title cards or section dividers.

**4. (Optional) Other AI toggles.**

Two clusters of additional AI features:

*Per-segment enrichments* — each runs Claude once per kept slide:
- **Section titles** — replaces "Segment 3 — 4:25" with a meaningful title for each section.
- **Alt-text drafts** — accessible image descriptions (purpose, not just transcription).
- **Key terms** — term + concise definition extracted per segment.
- **Math equations** — LaTeX captured from on-screen formulas with screen-reader labels.
- **On-screen text (Claude)** — replaces Tesseract OCR with Claude's reading of the slide. Better on coloured callouts, decorative fonts, mathematical notation.

*Whole-document features* — one call per video:
- **Topic-based segmentation** — what we covered above.
- **Also send slide images** — sub-toggle for topic segmentation.

Cost scales linearly with the per-segment toggles and the number of segments. The whole-document toggles are flat cost.

**5. Click "Preview topic scenes (AI)".**

The button is the prominent maroon CTA with a sparkles icon and "AI" badge. The status line updates: *"Detecting scenes, OCR, and asking Claude to segment topics…"* This is the long step:
- Scene detection: 10–30 seconds for a typical 5-minute video, scales with length.
- Tesseract OCR on each kept slide: ~200 ms per slide.
- Claude's topic call: ~5–10 seconds (transcript-only) or ~10–30 seconds (with slide images).

When complete, the picker shows one card per topic:
- A thumbnail of the picked primary slide for that topic.
- The topic's title (Claude's).
- The time range and slide count ("0:00–2:35 · 4 slides").
- A one-line summary (Claude's).

All cards are selected by default.

**6. Curate.**

Click any card to deselect (toggles its `aria-pressed` state). Use **Select all** / **Deselect all** to bulk-edit. The hint above reflects your current selection count.

Anything you uncheck is dropped from the final guide — its slides won't appear, and the per-segment LLM extraction call for it won't run (saving cost).

**7. Click "Generate study guide →".**

The form posts to the server with your selected topics. The pipeline picks up from where the preview left off and runs:
- LLM extraction on each kept topic's primary slide (alt-text, key terms, etc., depending on what you toggled).
- Audio slicing if you turned on "Per-segment audio" in the Media extras section.
- Render to your chosen output format.

The progress bar walks through the stages — the topic LLM call is skipped this time (we already did it during the preview).

**8. The editor opens.**

See [The Review & Edit screen](#the-review--edit-screen) below.

---

## Path B: Slide-by-slide scene picker (no API key)

If you don't have or don't want to use an Anthropic API key, the **Preview scenes** button is the primary CTA (the AI button is hidden when no key is set). Same flow as Path A but:

- No topic LLM call. The picker just shows every kept slide.
- One section per slide in the final guide.
- Tesseract handles all OCR.
- Section titles default to "Segment N — mm:ss".
- No section grouping; narration is bucketed by midpoint.

You can still hand-edit everything in the Review & Edit screen, so this path is fine for shorter lectures or when you just want the slides extracted and don't need topical structure.

---

## The Review & Edit screen

After **Generate** runs, your browser opens an in-page editor with:

- A toolbar at the top: **Save edits (JSON)**, **Load edits (JSON)**, **Preview**, **Download final HTML**, **Download ZIP** (if available).
- The rendered guide below, with every editable element outlined.

### What you can edit

- **Page title, subtitle, eyebrow text** — click to edit, type, click away.
- **Section title** — same. Replace Claude's title with anything you want.
- **Alt text** — the WCAG-grade description of the slide. If Claude generated a draft you didn't review yet, you'll see a "DRAFT" badge until you confirm it.
- **On-screen text from video** — the OCR text. Click to edit.
- **Narration** — the spoken text for the section. Click any paragraph to edit.
- **Key terms** — add/remove rows; each is a term + definition.
- **Math equations** — add/remove rows; each is LaTeX + an aria-label.
- **Include checkbox** — uncheck a segment to dim it (it stays in the document but visually marked as excluded).

### Per-slide controls

If a segment has alternate frames available (the scene detector saved them), a **Frame picker** appears under the primary slide. Click any alternate to swap it in. The OCR text auto-refreshes for the new pick.

### Save / load your edits

- **Save edits (JSON)** downloads a `.edits.json` file containing every field's current value. Use this to checkpoint a long edit session.
- **Load edits (JSON)** applies a previously-saved file back to the editor. Useful if you re-ran the pipeline and want to restore your manual edits on top.

### Preview

The **Preview** button opens a clean read-only render in a new tab. Useful for catching layout issues before downloading.

---

## Output formats

Pick from the **Output format** dropdown next to the Generate button:

| Format | Best for | Contents |
|---|---|---|
| **Review & edit** (default) | Reviewing before publishing | Opens the in-browser editor. From there you download HTML or ZIP. |
| **Single HTML file** | Email, archives | One self-contained `.html` file with all images and (optionally) audio inlined as data URIs. Largest file size but ultra-portable. |
| **Zip bundle** | Hosting on a website | A `.zip` containing `study-guide.html` and a `static/` folder with images and audio. Lighter HTML; you upload the folder structure. |

You can switch formats after generating from inside the editor's toolbar (Download HTML vs Download ZIP buttons).

---

## AI assist features

Detailed reference for the AI toggle section. All require an Anthropic API key.

### Per-segment enrichments

These each add one LLM call per kept slide. Cost scales with segment count.

| Toggle | What it does |
|---|---|
| **Section titles** | Replaces auto-generated "Segment N" labels with concise topical titles ("Combining AI models" instead of "Segment 3 — 4:25"). When topic segmentation is also on, the topic-level titles take precedence. |
| **Alt-text drafts** | Generates WCAG-purpose descriptions of each frame — captures what the image is *for* in the lecture's argument, not a transcription of slide text. One or two sentences, ≤200 characters. Marked as DRAFT in the editor until you review them. |
| **Key terms** | Extracts the terms a student should remember from each segment, with concise definitions (≤25 words each). Filters generic vocabulary and section labels. |
| **Math equations** | Captures equations *visibly displayed* on the slide as clean LaTeX (no `$` delimiters, no `\begin{equation}`) along with screen-reader-friendly aria-labels. Skips spoken-only equations. |
| **On-screen text (Claude)** | Replaces Tesseract OCR with Claude's reading of the slide. Better on coloured callouts, decorative fonts, mathematical notation. When this is on without topic segmentation, Tesseract is skipped entirely to save time. |

### Whole-document features

These run once per video, regardless of segment count.

| Toggle | What it does |
|---|---|
| **Topic-based segmentation** | Replaces one-segment-per-slide with one-segment-per-topic. Claude reads the full transcript and returns 3–15 sequential topic boundaries with titles and summaries. The longest-on-screen slide in each topic becomes the primary; the others become switchable alternates. |
| **Also send slide images** *(sub-toggle)* | Only available when the parent is checked. Sends downscaled thumbnails of every slide alongside the transcript so Claude can use visual chapter markers (splash slides, title cards, layout shifts) when picking boundaries. ~5–10× the token cost of the text-only call. |

### Video type (dropdown)

Below the topic toggles. Biases Claude toward the conventional structure of the genre — see [Path A step 2](#path-a-ai-topic-based-guide-recommended) for the full list.

---

## Advanced settings

The right-side sidebar. These shape what the pipeline detects as a "scene" before any LLM step runs.

| Setting | What it controls |
|---|---|
| **Scene-change sensitivity** | How visually different two consecutive frames must be before the second counts as a new scene. Lower (5–15) catches every animation step; higher (35+) only major slide changes. Default 27. |
| **Minimum gap between scenes** | Drops scenes that arrive less than N seconds after the previous kept one. Set ~5 s for slides that build up step-by-step — keeps only the final populated state. Default 0 (off). |
| **Instructor-frame threshold** | Face-detection threshold: if the largest detected face takes up more than this fraction of the image, the frame is dropped as "instructor talking head" rather than slide content. Raise to keep slides with inset webcam; lower to be more aggressive about cutaways. Default 0.12. |
| **Max frames** | Hard cap on total kept frames. Off by default. Useful for very long lectures (1 hr+) to bound LLM extraction cost. Frames are evenly distributed across the video when capped. |
| **Skip OCR** | Skip Tesseract entirely. On-screen-text panels will be empty unless Claude's "On-screen text (Claude)" toggle is on too. Doesn't affect LLM extraction's ability to read frames. **Note:** during the topic preview, Tesseract still runs to power the primary-frame picker — this toggle only affects the final segments. |
| **Skip inverted OCR pass** | Tesseract runs two passes by default (normal binarization + inverted, for white-on-coloured callouts). Skipping the inverted pass halves OCR time at the cost of losing callout-text recovery on slides without coloured highlights. |
| **Document language** | BCP-47 code (`en`, `en-US`, `fr`…). Sets the `lang` attribute on the rendered HTML (matters for screen readers) and picks the Tesseract language pack for OCR. Default `en`. |

### How advanced settings interact with AI mode

All seven settings still apply when AI features are on, even when every LLM toggle is checked. The CV settings determine the candidate pool that Claude sees in image-augmented topic mode; the OCR settings drive the primary-frame picker that decides which slide represents each topic. See [Tips for the best results](#tips-for-the-best-results) for typical tuning recipes.

---

## Tips for the best results

**Standard recorded lecture, mostly slides:**

- Defaults are tuned for this. Just paste your API key and click **Preview topic scenes (AI)**.

**Inset webcam in the corner of every slide:**

- Raise **Instructor-frame threshold** to 0.25–0.35 so the picker doesn't drop those slides as "talking head".

**Lecture with lots of animation builds (slides that fade in line by line):**

- Set **Minimum gap between scenes** to ~5 s. The picker keeps only the final populated state of each slide.

**Whiteboard / chalk talks with sparse slides:**

- Lower **Scene-change sensitivity** to 12–15. Catches subtle frame-to-frame board changes.

**Very long lecture (60+ min):**

- Set **Max frames** to 30–50 to bound the LLM cost. Frames are evenly sampled.
- Turn on topic segmentation — you'll get ~8–15 meaningful sections instead of dozens of slide-by-slide ones.

**Lab / experiment walkthrough:**

- Pick **Lab walkthrough** in the Video type dropdown.
- Lab demonstrations often have continuous procedure footage that the scene detector over-segments — try **Minimum gap** of 5 s.

**Math-heavy slides:**

- Turn on **Math equations** under AI assist. Tesseract garbles math; Claude reads it as LaTeX.
- If your math uses unusual notation, also turn on **On-screen text (Claude)** for cleaner OCR.

**Non-English video:**

- Set **Document language** to the correct BCP-47 code (`fr`, `es`, `de`…). This sets the HTML `lang` attribute and picks Tesseract's language pack.
- Note: Claude's prompts are always in English, but it reads non-English content fine. The topic titles will be in the transcript's language.

**Slides without text (photos, diagrams):**

- The primary-frame picker prefers slides with OCR text. For purely visual decks you may need to override picks manually in the editor — every alternate slide is available via the per-segment frame picker.

---

## Troubleshooting

**"Topic preview failed: …"**

Claude returned an unusable response — usually rate limiting or a malformed payload. The job error message names the underlying issue. Try again; if it persists, switch to **Preview scenes** (the non-AI button) which doesn't make the LLM call.

**"Transcript parsed to zero cues"**

The SRT/VTT file is empty or malformed. Open it in a text editor and confirm it contains timestamped cues. If it's a YouTube auto-caption export, make sure you grabbed the SRT, not the `.txt` version.

**"No usable visual segments found"**

The scene detector found nothing, or every scene was filtered as instructor-only. Try lowering **Scene-change sensitivity** (to ~15) and/or raising **Instructor-frame threshold** (to ~0.25).

**Narration breaks mid-sentence between segments**

The transcript probably lacks terminal punctuation (common with auto-generated captions). The sentence stitcher uses casing as a fallback when there's no `.` to align on, but for very ambiguous transcripts you may need to hand-edit segment boundaries in the editor.

**Topic preview returned a talking-head shot as primary for a topic**

The topic's time range covers a stretch with no real slide content (intro, Q&A, etc.). Swap to an alternate in the editor's frame picker, or uncheck that topic if you'd rather drop it.

**The guide is huge / takes forever to download**

Switch the output format to **Zip bundle** — the HTML stays small and assets live in a `static/` folder.

**Preview was successful but the editor never opens after Generate**

Check the browser console for the request — if `/jobs` returned an error, the message appears in the status line just below the form. Common cause: the preview cache expired (30-minute TTL). Re-run the preview.

---

## Glossary

**Alt text.** A short text description of an image that screen readers announce. WCAG-compliant alt text describes the image's *purpose* in context (what role it plays in the argument), not just what's pictured. Claude's "Alt-text drafts" toggle generates these.

**API key (Anthropic).** A credential that authorizes the app to call Claude on your behalf. Starts with `sk-ant-`. Get one from [console.anthropic.com](https://console.anthropic.com). The app uses it once per run and never persists it.

**BCP-47.** The IETF standard for language tags used in `lang` attributes. `en` = English, `en-US` = US English, `fr` = French, `es-MX` = Mexican Spanish, etc.

**Cue.** A single timestamped chunk of text in a transcript file — typically 2–5 seconds of speech. SRT/VTT files are sequences of cues. The pipeline groups consecutive cues into segment narration based on the segment's time range.

**Editor / Review & Edit screen.** The in-browser interface that opens after Generate completes. Lets you tweak titles, narration, alt text, key terms, math equations, and frame picks before downloading the final guide.

**Frame.** A still image extracted from the video. The pipeline keeps one "primary" frame per scene plus up to two "alternate" frames as candidate switches in the editor.

**Instructor filter / face threshold.** A face detector runs on each candidate frame; if the largest detected face takes up more than the threshold fraction of the image, the frame is dropped as a talking-head shot rather than slide content.

**LLM (large language model).** In this app, Claude. The "AI assist" toggles invoke Claude to enrich or restructure the output.

**OCR (optical character recognition).** Reading text from an image. The app uses Tesseract by default (local, free) and optionally Claude (via the "On-screen text (Claude)" toggle, more accurate but costs API tokens).

**Primary frame.** The slide image that represents a segment at the top of its section in the rendered guide. In topic mode, the primary is picked from the slides whose start time falls inside the topic's range, preferring slides with substantial OCR text over talking-head shots.

**Scene.** A stretch of video where the slide doesn't change much. PySceneDetect identifies scene boundaries by frame-to-frame visual difference; each scene contributes one primary frame to the candidate pool.

**Scene detection.** The CV (computer-vision) step that finds scene boundaries. Doesn't involve any AI / LLM call. Configured by the **Scene-change sensitivity** and **Min gap** sliders in the sidebar.

**Segment / section.** One unit in the rendered guide — a slide image, narration, on-screen text, plus optional AI enrichments. In topic mode = one per topic. In slide mode = one per scene.

**SRT / VTT.** The two transcript-file formats the app accepts. WebVTT (`.vtt`) and SubRip (`.srt`) — both store timestamped subtitle cues. Most caption-export tools produce one of these.

**Stitcher.** The component that prevents narration from breaking mid-sentence at segment boundaries. When the last cue of segment A ends mid-sentence, the stitcher pulls the next cue(s) from segment B until the sentence is complete (or, for transcripts without periods, until the next cue starts with a capital letter).

**Tesseract.** The open-source OCR engine the app uses for on-screen text by default. Runs locally — no network call, free. Worse than Claude on coloured backgrounds, decorative fonts, and math notation.

**Topic / Topic segmentation.** Optional AI feature where Claude reads the transcript and identifies natural topic boundaries (Introduction, Background, Method…) instead of using one section per slide. Produces fewer, more meaningful sections; narration flows continuously within each topic.

**Topic preview / Preview topic scenes (AI).** The button you click before generating. Runs scene detection, OCR, and the topic-LLM call so you can see Claude's proposed breakdown and uncheck topics you don't want before paying for the per-segment LLM extraction.