ONNX
Safetensors
Chinese
English
Dubbing-model
xuan3986 commited on
Commit
46063d9
Β·
verified Β·
1 Parent(s): 064664c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -2
README.md CHANGED
@@ -11,12 +11,148 @@ tags:
11
 
12
  ## 🎬 Fun-CineForge: A Unified Dataset Pipeline and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  **Fun-CineForge** contains an end-to-end dataset pipeline for producing large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes.
15
  Using this pipeline, we constructed the first large-scale Chinese television dubbing dataset CineDub-CN, which includes rich annotations and diverse scenes.
16
  In monologue, narration, dialogue, and multi-speaker scenes, our dubbing model consistently outperforms state-of-the-art methods in terms of audio quality, lip-sync, timbre transition, and instruction following.
17
 
18
- You can access [https://funcineforge.github.io/](https://funcineforge.github.io/) to get our CineDub dataset samples and demo samples.
 
 
19
 
20
  GitHub link: [https://github.com/FunAudioLLM/FunCineForge/](https://github.com/FunAudioLLM/FunCineForge/)
21
 
22
- Modelscope link: [https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/](https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  ## 🎬 Fun-CineForge: A Unified Dataset Pipeline and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes
13
 
14
+ <div align="center">
15
+
16
+ ![license](https://img.shields.io/github/license/modelscope/modelscope.svg)
17
+ <a href=""><img src="https://img.shields.io/badge/OS-Linux-orange.svg"></a>
18
+ <a href=""><img src="https://img.shields.io/badge/Python->=3.8-aff.svg"></a>
19
+ <a href=""><img src="https://img.shields.io/badge/Pytorch->=2.1-blue"></a>
20
+ </div>
21
+
22
+ <div align="center">
23
+ <h4><a href="#Dataset&Demo&Code">Dataset & Demo</a>
24
+ |<a href="#Dataset-Pipeline">Dataset Pipeline</a>
25
+ |<a href="#Dubbing-Model">Dubbing Model</a>
26
+ |<a href="#Recent-Updates">Recent Updates</a>
27
+ |<a href="#Publication">Publication</a>
28
+ |<a href="#Comminicate">Comminicate</a>
29
+ </h4>
30
+ </div>
31
+
32
  **Fun-CineForge** contains an end-to-end dataset pipeline for producing large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes.
33
  Using this pipeline, we constructed the first large-scale Chinese television dubbing dataset CineDub-CN, which includes rich annotations and diverse scenes.
34
  In monologue, narration, dialogue, and multi-speaker scenes, our dubbing model consistently outperforms state-of-the-art methods in terms of audio quality, lip-sync, timbre transition, and instruction following.
35
 
36
+ <a name="Dataset&Demo&Code"></a>
37
+ ## Dataset & Demo 🎬
38
+ You can access [https://funcineforge.github.io/](https://funcineforge.github.io/) to get our CineDub-CN dataset samples and demo samples.
39
 
40
  GitHub link: [https://github.com/FunAudioLLM/FunCineForge/](https://github.com/FunAudioLLM/FunCineForge/)
41
 
42
+ Modelscope link: [https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/](https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/)
43
+
44
+ <a name="Dataset-Pipeline"></a>
45
+ ## Dataset Pipeline πŸ”¨
46
+
47
+ ### Environmental Installation
48
+
49
+ Fun-CineForge dataset pipeline toolkit only relies on a Python environment to run.
50
+ ```shell
51
+ # Conda
52
+ git clone git@github.com:FunAudioLLM/FunResearch.git
53
+ conda create -n FunCineForge python=3.10 -y && conda activate FunCineForge
54
+ sudo apt-get install ffmpeg
55
+ # Initial settings
56
+ cd FunCineForge
57
+ python setup.py
58
+ ```
59
+
60
+ ### Data collection
61
+ If you want to produce your own data,
62
+ we recommend that you refer to the following requirements to collect the corresponding movies or television series.
63
+
64
+ 1. Video source: TV dramas or movies, non documentaries, with more monologues or dialogue scenes, clear and unobstructed faces (such as without masks and veils).
65
+ 2. Speech Requirements: Standard pronunciation, clear articulation, prominent human voice. Avoid materials with strong dialects, excessive background noise, or strong colloquialism.
66
+ 3. Image Requirements: High resolution, clear facial details, sufficient lighting, avoiding extremely dark or strong backlit scenes.
67
+
68
+ ### How to use
69
+
70
+ - [1] Standardize video format and name; trim the beginning and end of long videos; extract the audio from the trimmed video. (default is to trim 10 seconds from both the beginning and end.)
71
+ ```shell
72
+ python normalize_trim.py --root datasets/raw_zh --intro 10 --outro 10
73
+ ```
74
+
75
+ - [2] [Speech Separation](./speech_separation/README.md). The audio is used to separate the vocals from the instrumental music.
76
+ ```shell
77
+ cd speech_separation
78
+ python run.py --root datasets/clean/zh --gpus 0 1 2 3
79
+ ```
80
+
81
+ - [3] [VideoClipper](./video_clip/README.md). For long videos, VideoClipper is used to obtain sentence-level subtitle files and clip the long video into segments based on timestamps. Now it supports bilingualism in both Chinese and English. Below is an example in Chinese. It is recommended to use gpu acceleration for English.
82
+ ```shell
83
+ cd video_clip
84
+ bash run.sh --stage 1 --stop_stage 2 --input datasets/raw_zh --output datasets/clean/zh --lang zh --device cpu
85
+ ```
86
+
87
+ - Video duration limit and check for cleanup. (Without --execute, only pre-deleted files will be printed. After checking, add --execute to confirm the deletion.)
88
+ ```shell
89
+ python clean_video.py --root datasets/clean/zh
90
+ python clean_srt.py --root datasets/clean/zh --lang zh
91
+ ```
92
+
93
+ - [4] [Speaker Diarization](./speaker_diarization/README.md). Multimodal active speaker recognition obtains RTTM files; identifies the speaker's facial frames, extracts frame-level speaker face and lip raw data.
94
+ ```shell
95
+ cd speaker_diarization
96
+ bash run.sh --stage 1 --stop_stage 4 --hf_access_token hf_xxx --root datasets/clean/zh --gpus "0 1 2 3"
97
+ ```
98
+
99
+ - [5] Multimodal CoT Correction. Based on general-purpose MLLMs, the system uses audio, ASR text, and RTTM files as input. It leverages Chain-of-Thought (CoT) reasoning to extract clues and corrects the results of the specialized models. It also annotates character age, gender, and vocal timbre. Experimental results show that this strategy reduces the CER from 4.53% to 0.94% and the speaker diarization error rate from 8.38% to 1.20%, achieving quality comparable to or even better than manual transcription. Adding the --resume enables breakpoint COT inference to prevent wasted resources from repeated COT inferences. Now supports both Chinese and English.
100
+ ```shell
101
+ python cot.py --root_dir datasets/clean/zh --lang zh --provider google --model gemini-3-pro-preview --api_key xxx --resume
102
+ python cot.py --root_dir datasets/clean/en --lang en --provider google --model gemini-3-pro-preview --api_key xxx --resume
103
+ python build_datasets.py --root_zh datasets/clean/zh --root_en datasets/clean/en --out_dir datasets/clean --save
104
+ ```
105
+
106
+ - (Reference) Extract speech tokens based on the CosyVoice3 tokenizer for llm training.
107
+ ```shell
108
+ python speech_tokenizer.py --root datasets/clean/zh
109
+ ```
110
+
111
+ <a name="Dubbing-Model"></a>
112
+ ## Dubbing Model βš™οΈ
113
+ We've open-sourced the inference code and the **infer.sh** script, and provided some test cases in the data folder for your experience. Inference requires a consumer-grade GPU. Run the following command:
114
+
115
+ ```shell
116
+ cd exps
117
+ bash infer.sh
118
+ ```
119
+
120
+ The API for multi-speaker dubbing from raw videos and SRT scripts is under development ...
121
+
122
+ <a name="Recent-Updates"></a>
123
+ ## Recent Updates πŸš€
124
+ - 2025/12/18: Fun-CineForge dataset pipeline toolkit is online! πŸ”₯
125
+ - 2026/01/19: Chinese demo samples and CineDub-CN dataset samples released. πŸ”₯
126
+ - 2026/01/25: Fix some environmental and operational issues.
127
+ - 2026/02/09: Optimized the data pipeline and added support for English videos.
128
+ - 2026/03/05: English demo samples and CineDub-EN dataset samples released. πŸ”₯
129
+ - 2026/03/16: Open source inference code and checkpoints. πŸ”₯
130
+
131
+ <a name="Publication"></a>
132
+ ## Publication πŸ“š
133
+ If you use our dataset or code, please cite the following paper:
134
+ <pre>
135
+ @misc{liu2026funcineforgeunifieddatasettoolkit,
136
+ title={FunCineForge: A Unified Dataset Toolkit and Model for Zero-Shot Movie Dubbing in Diverse Cinematic Scenes},
137
+ author={Jiaxuan Liu and Yang Xiang and Han Zhao and Xiangang Li and Zhenhua Ling},
138
+ year={2026},
139
+ eprint={2601.14777},
140
+ archivePrefix={arXiv},
141
+ primaryClass={cs.CV},
142
+ }
143
+ </pre>
144
+
145
+ <a name="Comminicate"></a>
146
+ ## Comminicate 🍟
147
+ We welcome you to participate in discussions on Fun-CineForge [GitHub Issues](https://github.com/FunAudioLLM/FunCineForge/issues) or contact us for collaborative development.
148
+ For any questions, you can contact the [developer](mailto:jxliu@mail.ustc.edu.cn).
149
+
150
+ ### Disclaimer
151
+
152
+ This repository contains research artifacts:
153
+
154
+ ⚠️ Currently not a commercial product of Tongyi Lab.
155
+
156
+ ⚠️ Released for academic research / cutting-edge exploration purposes
157
+
158
+ ⚠️ CineDub Dataset samples are subject to specific license terms.