Foley-Omni

Overview

This repository packages the public inference checkpoint set for Foley-Omni. The release focuses on Video-to-Soundtrack (V2ST) generation, where the model jointly generates synchronized speech, sound effects, and music from a video and optional text prompt.

Model Size

5.5B

Repository Contents

ckpts/
├── Foley-Omni/
│   └── v2st.pth
├── Wan2.2-TI2V-5B/
│   ├── models_t5_umt5-xxl-enc-bf16.pth
│   └── google/
│       └── umt5-xxl/
│           ├── special_tokens_map.json
│           ├── spiece.model
│           ├── tokenizer.json
│           └── tokenizer_config.json
└── mmaudio/
    └── ext_weights/
        ├── v1-16.pth
        ├── best_netG.pt
        └── synchformer_state_dict.pth

What each part is used for:

ckpts/Foley-Omni/v2st.pth: released inference-only Foley-Omni weights
ckpts/Wan2.2-TI2V-5B/*: text encoder and tokenizer for text conditioning
ckpts/mmaudio/ext_weights/v1-16.pth: audio VAE for the 16 kHz inference path
ckpts/mmaudio/ext_weights/best_netG.pt: vocoder for waveform decoding
ckpts/mmaudio/ext_weights/synchformer_state_dict.pth: online visual feature extraction

Online Feature Extraction

This release supports both:

direct V2ST inference with pre-extracted clip_feature_path and sync_feature_path
V2ST inference without pre-extracted features, using online visual feature extraction

Notes:

synchformer_state_dict.pth is included in this repository because it is required for online Sync feature extraction.
The CLIP image encoder is loaded by open_clip from apple/DFN5B-CLIP-ViT-H-14-384 on first use. The current code path does not use a separate local CLIP checkpoint file.

Source Attribution

This repository redistributes a small subset of files from the following upstream releases for convenience:

Wan2.2-TI2V-5B: text encoder and tokenizer files
MMAudio: audio VAE, vocoder, and Synchformer files

Please refer to the original upstream repositories for their licenses, usage terms, and project details.

Downloads last month: -

Paper for CocoBro/Foley-Omni

Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation

Paper • 2606.03672 • Published 4 days ago