FunAudioLLM
/

Fun-CineForge

@@ -23,6 +23,7 @@ tags:
 <div align="center">
 <h4><a href="#Open-Source">Open Source</a>
 ｜<a href="#Dataset-Pipeline">Dataset Pipeline</a>
 ｜<a href="#Dubbing-Model">Dubbing Model</a>
 ｜<a href="#Recent-Updates">Recent Updates</a>
@@ -31,9 +32,7 @@ tags:
 </h4>
 </div>
-**Fun-CineForge** contains an end-to-end dataset pipeline for producing large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes.
-Using this pipeline, we constructed the first large-scale Chinese television dubbing dataset CineDub-CN, which includes rich annotations and diverse scenes.
-In monologue, narration, dialogue, and multi-speaker scenes, our dubbing model consistently outperforms state-of-the-art methods in terms of audio quality, lip-sync, timbre transition, and instruction following.
 <a name="Open-Source"></a>
 ## Open Source 🎬
@@ -41,18 +40,17 @@ You can access [https://funcineforge.github.io/](https://funcineforge.github.io/
 GitHub link: [https://github.com/FunAudioLLM/FunCineForge/](https://github.com/FunAudioLLM/FunCineForge/)
-Modelscope link: [https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/](https://www.modelscope.cn/models/FunAudioLLM/Fun-CineForge/)
 CineDub Samples:
 [huggingface](https://huggingface.co/datasets/FunAudioLLM/CineDub-Example/)
 [modelscope](https://www.modelscope.cn/datasets/FunAudioLLM/CineDub-Example)
-<a name="Dataset-Pipeline"></a>
-## Dataset Pipeline 🔨
-### Environmental Installation
-Fun-CineForge dataset pipeline toolkit only relies on a Python environment to run.
 ```shell
 # Conda
 git clone git@github.com:FunAudioLLM/FunCineForge.git
@@ -62,6 +60,9 @@ sudo apt-get install ffmpeg
 python setup.py
 ```
 ### Data collection
 If you want to produce your own data,
 we recommend that you refer to the following requirements to collect the corresponding movies or television series.
@@ -101,16 +102,20 @@ cd speaker_diarization
 bash run.sh --stage 1 --stop_stage 4 --hf_access_token hf_xxx --root datasets/clean/zh --gpus "0 1 2 3"
 ```
 - [5] Multimodal CoT Correction. Based on general-purpose MLLMs, the system uses audio, ASR text, and RTTM files as input. It leverages Chain-of-Thought (CoT) reasoning to extract clues and corrects the results of the specialized models. It also annotates character age, gender, and vocal timbre. Experimental results show that this strategy reduces the CER from 4.53% to 0.94% and the speaker diarization error rate from 8.38% to 1.20%, achieving quality comparable to or even better than manual transcription. Adding the --resume enables breakpoint COT inference to prevent wasted resources from repeated COT inferences. Now supports both Chinese and English.
 ```shell
 python cot.py --root_dir datasets/clean/zh --lang zh --provider google --model gemini-3-pro-preview --api_key xxx --resume
 python cot.py --root_dir datasets/clean/en --lang en --provider google --model gemini-3-pro-preview --api_key xxx --resume
-python build_datasets.py --root_zh datasets/clean/zh --root_en datasets/clean/en --out_dir datasets/clean --save
 ```
-- (Reference) Extract speech tokens based on the CosyVoice3 tokenizer for llm training.
 ```shell
-python speech_tokenizer.py --root datasets/clean/zh
 ```
 <a name="Dubbing-Model"></a>
@@ -149,9 +154,12 @@ If you use our dataset or code, please cite the following paper:
 <a name="Comminicate"></a>
 ## Comminicate 🍟
 We welcome you to participate in discussions on Fun-CineForge [GitHub Issues](https://github.com/FunAudioLLM/FunCineForge/issues) or contact us for collaborative development.
 For any questions, you can contact the [developer](mailto:jxliu@mail.ustc.edu.cn).
 ### Disclaimer
 This repository contains research artifacts:

 <div align="center">
 <h4><a href="#Open-Source">Open Source</a>
+｜<a href="#Environment">Environment</a>
 ｜<a href="#Dataset-Pipeline">Dataset Pipeline</a>
 ｜<a href="#Dubbing-Model">Dubbing Model</a>
 ｜<a href="#Recent-Updates">Recent Updates</a>
 </h4>
 </div>
+**Fun-CineForge** contains an end-to-end dataset pipeline for producing large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using this pipeline, we constructed the first large-scale Chinese television dubbing dataset CineDub-CN, which includes rich annotations and diverse scenes. In monologue, narration, dialogue, and multi-speaker scenes, our dubbing model consistently outperforms state-of-the-art methods in terms of audio quality, lip-sync, timbre transition, and instruction following.
 <a name="Open-Source"></a>
 ## Open Source 🎬
 GitHub link: [https://github.com/FunAudioLLM/FunCineForge/](https://github.com/FunAudioLLM/FunCineForge/)
+HuggingFace link: [https://huggingface.co/FunAudioLLM/Fun-CineForge/](https://huggingface.co/FunAudioLLM/Fun-CineForge/)
 CineDub Samples:
 [huggingface](https://huggingface.co/datasets/FunAudioLLM/CineDub-Example/)
 [modelscope](https://www.modelscope.cn/datasets/FunAudioLLM/CineDub-Example)
+<a name="Environment"></a>
+## Environmental Installation
+Fun-CineForge relies on Conda and Python environments. Execute **setup.py** to automatically install the entire project environment and open-source model.
 ```shell
 # Conda
 git clone git@github.com:FunAudioLLM/FunCineForge.git
 python setup.py
 ```
+<a name="Dataset-Pipeline"></a>
+## Dataset Pipeline 🔨
 ### Data collection
 If you want to produce your own data,
 we recommend that you refer to the following requirements to collect the corresponding movies or television series.
 bash run.sh --stage 1 --stop_stage 4 --hf_access_token hf_xxx --root datasets/clean/zh --gpus "0 1 2 3"
 ```
+- (Reference) Extract speech tokens based on the CosyVoice3 tokenizer for llm training.
+```shell
+python speech_tokenizer.py --root datasets/clean/zh
+```
 - [5] Multimodal CoT Correction. Based on general-purpose MLLMs, the system uses audio, ASR text, and RTTM files as input. It leverages Chain-of-Thought (CoT) reasoning to extract clues and corrects the results of the specialized models. It also annotates character age, gender, and vocal timbre. Experimental results show that this strategy reduces the CER from 4.53% to 0.94% and the speaker diarization error rate from 8.38% to 1.20%, achieving quality comparable to or even better than manual transcription. Adding the --resume enables breakpoint COT inference to prevent wasted resources from repeated COT inferences. Now supports both Chinese and English.
 ```shell
 python cot.py --root_dir datasets/clean/zh --lang zh --provider google --model gemini-3-pro-preview --api_key xxx --resume
 python cot.py --root_dir datasets/clean/en --lang en --provider google --model gemini-3-pro-preview --api_key xxx --resume
 ```
+- The construction of the dataset retrieval file will read all production data, perform bidirectional verification of script content and speaker separation results.
 ```shell
+python build_datasets.py --root_zh datasets/clean/zh --root_en datasets/clean/en --out_dir datasets/clean --save
 ```
 <a name="Dubbing-Model"></a>
 <a name="Comminicate"></a>
 ## Comminicate 🍟
+The Fun-CineForge open-source project is developed and maintained by the Tongyi Lab Speech Team and a student from NERCSLIP, University of Science and Technology of China.
 We welcome you to participate in discussions on Fun-CineForge [GitHub Issues](https://github.com/FunAudioLLM/FunCineForge/issues) or contact us for collaborative development.
 For any questions, you can contact the [developer](mailto:jxliu@mail.ustc.edu.cn).
+⭐ Hope you will support Fun-CineForge. Thank you.
 ### Disclaimer
 This repository contains research artifacts: