File size: 8,353 Bytes
edd8172 829913a edd8172 c6aca41 1c8a60b c6aca41 46063d9 9233e74 8165d35 46063d9 8165d35 c6aca41 1ae7dc1 46063d9 064664c 829913a 064664c 40140e5 46063d9 4d8e0d5 d446758 4d8e0d5 1ae7dc1 8165d35 46063d9 8165d35 46063d9 212a194 46063d9 8165d35 46063d9 8165d35 46063d9 8165d35 46063d9 8165d35 46063d9 8165d35 46063d9 8165d35 46063d9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | ---
license: apache-2.0
datasets:
- FunAudioLLM/CineDub-Example
language:
- zh
- en
tags:
- Dubbing-model
---
<p align="center">
<b>π¬ Fun-CineForge: A Unified Dataset Pipeline and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes</b>
</p>
<div align="center">

<a href=""><img src="https://img.shields.io/badge/OS-Linux-orange.svg"></a>
<a href=""><img src="https://img.shields.io/badge/Python->=3.8-aff.svg"></a>
<a href=""><img src="https://img.shields.io/badge/Pytorch->=2.1-blue"></a>
</div>
<div align="center">
<h4><a href="#Open-Source">Open Source</a>
ο½<a href="#Environment">Environment</a>
ο½<a href="#Dataset-Pipeline">Dataset Pipeline</a>
ο½<a href="#Dubbing-Model">Dubbing Model</a>
ο½<a href="#Recent-Updates">Recent Updates</a>
ο½<a href="#Publication">Publication</a>
ο½<a href="#Comminicate">Comminicate</a>
</h4>
</div>
**Fun-CineForge** contains an end-to-end dataset pipeline for producing large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using this pipeline, we constructed the first large-scale Chinese television dubbing dataset CineDub-CN, which includes rich annotations and diverse scenes. In monologue, narration, dialogue, and multi-speaker scenes, our dubbing model consistently outperforms state-of-the-art methods in terms of audio quality, lip-sync, timbre transition, and instruction following.
<a name="Open-Source"></a>
## Open Source π¬
You can access [https://funcineforge.github.io/](https://funcineforge.github.io/) to get our CineDub-CN dataset samples and demo samples.
GitHub link: [https://github.com/FunAudioLLM/FunCineForge/](https://github.com/FunAudioLLM/FunCineForge/)
ModelScope link: [https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/](https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/)
CineDub Samples:
[huggingface](https://huggingface.co/datasets/FunAudioLLM/CineDub-Example/)
[modelscope](https://www.modelscope.cn/datasets/FunAudioLLM/CineDub-Example)
<a name="Environment"></a>
## Environmental Installation
Fun-CineForge relies on Conda and Python environments. Execute **setup.py** to automatically install the entire project environment and open-source model.
```shell
# Conda
git clone git@github.com:FunAudioLLM/FunCineForge.git
conda create -n FunCineForge python=3.10 -y && conda activate FunCineForge
sudo apt-get install ffmpeg
# Initial settings
python setup.py
```
<a name="Dataset-Pipeline"></a>
## Dataset Pipeline π¨
### Data collection
If you want to produce your own data,
we recommend that you refer to the following requirements to collect the corresponding movies or television series.
1. Video source: TV dramas or movies, non documentaries, with more monologues or dialogue scenes, clear and unobstructed faces (such as without masks and veils).
2. Speech Requirements: Standard pronunciation, clear articulation, prominent human voice. Avoid materials with strong dialects, excessive background noise, or strong colloquialism.
3. Image Requirements: High resolution, clear facial details, sufficient lighting, avoiding extremely dark or strong backlit scenes.
### How to use
- [1] Standardize video format and name; trim the beginning and end of long videos; extract the audio from the trimmed video. (default is to trim 10 seconds from both the beginning and end.)
```shell
python normalize_trim.py --root datasets/raw_zh --intro 10 --outro 10
```
- [2] [Speech Separation](./speech_separation/README.md). The audio is used to separate the vocals from the instrumental music.
```shell
cd speech_separation
python run.py --root datasets/clean/zh --gpus 0 1 2 3
```
- [3] [VideoClipper](./video_clip/README.md). For long videos, VideoClipper is used to obtain sentence-level subtitle files and clip the long video into segments based on timestamps. Now it supports bilingualism in both Chinese and English. Below is an example in Chinese. It is recommended to use gpu acceleration for English.
```shell
cd video_clip
bash run.sh --stage 1 --stop_stage 2 --input datasets/raw_zh --output datasets/clean/zh --lang zh --device cpu
```
- Video duration limit and check for cleanup. (Without --execute, only pre-deleted files will be printed. After checking, add --execute to confirm the deletion.)
```shell
python clean_video.py --root datasets/clean/zh
python clean_srt.py --root datasets/clean/zh --lang zh
```
- [4] [Speaker Diarization](./speaker_diarization/README.md). Multimodal active speaker recognition obtains RTTM files; identifies the speaker's facial frames, extracts frame-level speaker face and lip raw data.
```shell
cd speaker_diarization
bash run.sh --stage 1 --stop_stage 4 --hf_access_token hf_xxx --root datasets/clean/zh --gpus "0 1 2 3"
```
- (Reference) Extract speech tokens based on the CosyVoice3 tokenizer for llm training.
```shell
python speech_tokenizer.py --root datasets/clean/zh
```
- [5] Multimodal CoT Correction. Based on general-purpose MLLMs, the system uses audio, ASR text, and RTTM files as input. It leverages Chain-of-Thought (CoT) reasoning to extract clues and corrects the results of the specialized models. It also annotates character age, gender, and vocal timbre. Experimental results show that this strategy reduces the CER from 4.53% to 0.94% and the speaker diarization error rate from 8.38% to 1.20%, achieving quality comparable to or even better than manual transcription. Adding the --resume enables breakpoint COT inference to prevent wasted resources from repeated COT inferences. Now supports both Chinese and English.
```shell
python cot.py --root_dir datasets/clean/zh --lang zh --provider google --model gemini-3-pro-preview --api_key xxx --resume
python cot.py --root_dir datasets/clean/en --lang en --provider google --model gemini-3-pro-preview --api_key xxx --resume
```
- The construction of the dataset retrieval file will read all production data, perform bidirectional verification of script content and speaker separation results.
```shell
python build_datasets.py --root_zh datasets/clean/zh --root_en datasets/clean/en --out_dir datasets/clean --save
```
<a name="Dubbing-Model"></a>
## Dubbing Model βοΈ
We've open-sourced the inference code and the **infer.sh** script, and provided some test cases in the data folder for your experience. Inference requires a consumer-grade GPU. Run the following command:
```shell
cd exps
bash infer.sh
```
The API for multi-speaker dubbing from raw videos and SRT scripts is under development ...
<a name="Recent-Updates"></a>
## Recent Updates π
- 2025/12/18: Fun-CineForge dataset pipeline toolkit is online! π₯
- 2026/01/19: Chinese demo samples and CineDub-CN dataset samples released. π₯
- 2026/01/25: Fix some environmental and operational issues.
- 2026/02/09: Optimized the data pipeline and added support for English videos.
- 2026/03/05: English demo samples and CineDub-EN dataset samples released. π₯
- 2026/03/16: Open source inference code and checkpoints. π₯
<a name="Publication"></a>
## Publication π
If you use our dataset or code, please cite the following paper:
<pre>
@misc{liu2026funcineforgeunifieddatasettoolkit,
title={FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes},
author={Jiaxuan Liu and Yang Xiang and Han Zhao and Xiangang Li and Zhenhua Ling},
year={2026},
eprint={2601.14777},
archivePrefix={arXiv},
primaryClass={cs.CV},
}
</pre>
<a name="Comminicate"></a>
## Comminicate π
The Fun-CineForge open-source project is developed and maintained by the Tongyi Lab Speech Team and a student from NERCSLIP, University of Science and Technology of China.
We welcome you to participate in discussions on Fun-CineForge [GitHub Issues](https://github.com/FunAudioLLM/FunCineForge/issues) or contact us for collaborative development.
For any questions, you can contact the [developer](mailto:jxliu@mail.ustc.edu.cn).
β Hope you will support Fun-CineForge. Thank you.
### Disclaimer
This repository contains research artifacts:
β οΈ Currently not a commercial product of Tongyi Lab.
β οΈ Released for academic research / cutting-edge exploration purposes
β οΈ CineDub Dataset samples are subject to specific license terms. |