Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation
Paper β’ 2606.03672 β’ Published
GitHub Code | arXiv | Demo
This repository packages the public inference checkpoint set for Foley-Omni. The release focuses on Video-to-Soundtrack (V2ST) generation, where the model jointly generates synchronized speech, sound effects, and music from a video and optional text prompt.
5.5B
ckpts/
βββ Foley-Omni/
β βββ v2st.pth
βββ Wan2.2-TI2V-5B/
β βββ models_t5_umt5-xxl-enc-bf16.pth
β βββ google/
β βββ umt5-xxl/
β βββ special_tokens_map.json
β βββ spiece.model
β βββ tokenizer.json
β βββ tokenizer_config.json
βββ mmaudio/
βββ ext_weights/
βββ v1-16.pth
βββ best_netG.pt
βββ synchformer_state_dict.pth
What each part is used for:
ckpts/Foley-Omni/v2st.pth: released inference-only Foley-Omni weightsckpts/Wan2.2-TI2V-5B/*: text encoder and tokenizer for text conditioningckpts/mmaudio/ext_weights/v1-16.pth: audio VAE for the 16 kHz inference pathckpts/mmaudio/ext_weights/best_netG.pt: vocoder for waveform decodingckpts/mmaudio/ext_weights/synchformer_state_dict.pth: online visual feature extractionThis release supports both:
clip_feature_path and sync_feature_pathNotes:
synchformer_state_dict.pth is included in this repository because it is required for online Sync feature extraction.open_clip from apple/DFN5B-CLIP-ViT-H-14-384 on first use. The current code path does not use a separate local CLIP checkpoint file.This repository redistributes a small subset of files from the following upstream releases for convenience:
Please refer to the original upstream repositories for their licenses, usage terms, and project details.