import os import json import tempfile import subprocess import gradio as gr import numpy as np import torch from funasr import AutoModel model = AutoModel( model="iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch", hub="hf", model_hub="hf", device="cpu", ) def extract_audio(video_path): audio_path = tempfile.mktemp(suffix=".wav") cmd = [ "ffmpeg", "-i", video_path, "-vn", "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", "-y", audio_path ] subprocess.run(cmd, capture_output=True) return audio_path def transcribe_video(video_path, progress=gr.Progress()): if video_path is None: return "Please upload a video file.", [], None progress(0.1, desc="Extracting audio...") audio_path = extract_audio(video_path) if not os.path.exists(audio_path): return "Failed to extract audio from video. Make sure it contains an audio track.", [], None progress(0.3, desc="Transcribing speech...") try: res = model.generate(input=audio_path, batch_size_s=300) except Exception as e: return f"Transcription error: {str(e)}", [], None finally: if os.path.exists(audio_path): os.unlink(audio_path) if not res or not res[0].get("sentence_info"): text = res[0].get("text", "") if res else "" return text, [], None progress(0.8, desc="Processing timestamps...") sentences = [] for sent in res[0]["sentence_info"]: start_ms = sent["start"] end_ms = sent["end"] text = sent["text"] sentences.append({ "start": start_ms / 1000.0, "end": end_ms / 1000.0, "text": text, }) full_text = "\n".join( [f"[{s['start']:.1f}s - {s['end']:.1f}s] {s['text']}" for s in sentences] ) progress(1.0, desc="Done!") return full_text, sentences, json.dumps(sentences, ensure_ascii=False) def clip_video(video_path, sentences_json, selected_indices): if not video_path or not sentences_json or not selected_indices: return None, "Please transcribe a video first, then select segments to clip." sentences = json.loads(sentences_json) indices = [int(i) for i in selected_indices] if not indices: return None, "No segments selected." clips = [] for idx in sorted(indices): if 0 <= idx < len(sentences): clips.append((sentences[idx]["start"], sentences[idx]["end"])) if not clips: return None, "Invalid selection." merged = [clips[0]] for start, end in clips[1:]: if start - merged[-1][1] < 0.5: merged[-1] = (merged[-1][0], end) else: merged.append((start, end)) output_path = tempfile.mktemp(suffix=".mp4") filter_parts = [] for i, (start, end) in enumerate(merged): filter_parts.append( f"[0:v]trim=start={start:.3f}:end={end:.3f},setpts=PTS-STARTPTS[v{i}];" f"[0:a]atrim=start={start:.3f}:end={end:.3f},asetpts=PTS-STARTPTS[a{i}];" ) concat_v = "".join(f"[v{i}]" for i in range(len(merged))) concat_a = "".join(f"[a{i}]" for i in range(len(merged))) filter_parts.append(f"{concat_v}{concat_a}concat=n={len(merged)}:v=1:a=1[outv][outa]") filter_complex = "".join(filter_parts) cmd = [ "ffmpeg", "-i", video_path, "-filter_complex", filter_complex, "-map", "[outv]", "-map", "[outa]", "-y", output_path ] result = subprocess.run(cmd, capture_output=True, text=True) if result.returncode != 0: return None, f"FFmpeg error: {result.stderr[-500:]}" total_duration = sum(end - start for start, end in merged) return output_path, f"Clipped {len(merged)} segment(s), total {total_duration:.1f}s" description_html = """
AI Video Clipping — Speak to Clip
Upload a video → Auto-transcribe with timestamps → Select text segments → Export precise clips