--- license: apache-2.0 pipeline_tag: image-text-to-video ---
## ๐ฆ Installation
### Requirements
- **Python** 3.11.2.
- **CUDA GPU** โ a Hopper GPU (H100/H800/H200) is recommended so FlashAttention-3
can be used; other CUDA GPUs fall back to FlashAttention-2 or PyTorch SDPA.
- **CUDA toolkit** 12.4 (matches the pinned `torch==2.5.1+cu124`; 12.3+ is the
minimum if you build FlashAttention-3).
- Pinned in `requirements.txt`: `torch==2.5.1+cu124`, `diffusers==0.35.2`,
`accelerate==0.34.2`, `transformers==4.57.3`.
Reference environment (Bernini-R is developed and tested on this setup):
| Component | Version |
|-----------|--------------|
| GPU | NVIDIA H100 |
| CUDA | 12.4 |
| Python | 3.11.2 |
| PyTorch | 2.5.1+cu124 |
### Install
```bash
git clone https://github.com/bytedance/Bernini.git bernini && cd bernini
pip install -r requirements.txt
```
Optional extras:
- **Multi-GPU sequence parallel** needs [Open-VeOmni](https://github.com/ByteDance-Seed/VeOmni)
(Apache-2.0, Python 3.11). Use `--no-deps` so VeOmni does not pull in a
different torch build and override the pinned `torch==2.5.1+cu124`:
`pip install --no-deps git+https://github.com/ByteDance-Seed/VeOmni.git@v0.1.10`.
Single-GPU inference does not need it.
- **Faster attention** (auto-detected if installed; otherwise PyTorch SDPA is used):
- FlashAttention-2 โ general CUDA GPUs (incl. A100/A800): `pip install flash-attn==2.8.3`.
- FlashAttention-3 โ Hopper only (H100/H800/H200, CUDA โฅ 12.3, PyTorch โฅ 2.4).
`flash_attn_interface` is not on PyPI; build it from the
[flash-attention](https://github.com/Dao-AILab/flash-attention) repo's
`hopper/` directory at tag `v2.8.3`:
```bash
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && git checkout v2.8.3
cd hopper && MAX_JOBS=$(nproc) python3 setup.py install --user
```
### Weights
Bernini-R provides two ways to obtain the renderer weights. The **diffusers
format is recommended** โ it is a self-contained diffusers-format directory whose
`transformer` / `transformer_2` already hold the Bernini-R weights, so you point
`--config` at it and the weights load directly, with **no** `--high_noise_ckpt` /
`--low_noise_ckpt` needed.
#### Option A โ diffusers format (recommended)
A single ready-to-use diffusers-format model from
[`ByteDance/Bernini-R-Diffusers`](https://huggingface.co/ByteDance/Bernini-R-Diffusers).
It bundles the Wan2.2 base components (VAE, UMT5 text encoder, tokenizer) together
with the Bernini-R transformer weights, so nothing else is downloaded at runtime.
```bash
pip install -U "huggingface_hub"
hf download ByteDance/Bernini-R-Diffusers --local-dir Bernini-R-Diffusers
```
Then pass it via `--config` and omit the checkpoint flags, e.g.:
```bash
python infer_single_gpu.py --config Bernini-R-Diffusers \
--case assets/testcases/t2i/t2i.json --num_frames 1
```
#### Option B โ separate checkpoints
The original layout, where Bernini-R uses two sets of weights loaded separately:
1. **Wan2.2 base** โ [`Wan-AI/Wan2.2-T2V-A14B-Diffusers`](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers) on Hugging Face. Supplies the
VAE, UMT5 text encoder, tokenizer, and the transformer architecture/base weights.
It is downloaded automatically on first run (configured by `wan22_base` in
`configs/bernini_renderer_wan22/config.json`).
2. **Bernini-R checkpoint** โ the trained high-noise / low-noise transformer weights
(safetensors) from [ByteDance/Bernini-R](https://huggingface.co/ByteDance/Bernini-R), passed with
`--high_noise_ckpt` / `--low_noise_ckpt`. Both a local directory and a Hugging
Face repo id are accepted.
Download models using huggingface-cli:
```bash
pip install -U "huggingface_hub"
hf download Wan-AI/Wan2.2-T2V-A14B-Diffusers --local-dir Wan2.2-T2V-A14B-Diffusers
hf download ByteDance/Bernini-R --local-dir Bernini-R
```
## ๐ Usage
A run is described by a **case file** โ a small JSON under
[`assets/testcases/`](assets/testcases/) that bundles one task's routing and
inputs (`task_type`, `guidance_mode`, `prompt`, source media, `output`). This
keeps long prompts out of the command line. Each task has a directory under
`assets/testcases/` holding one or more case files; see
[`assets/testcases/`](assets/testcases/) for the format and the bundled
`t2i` / `i2i` / `t2v` / `v2v` / `rv2v` /`r2v` examples.
### Prompt enhancer (highly recommended)
`--use_pe` enhances the prompt through an OpenAI-compatible endpoint and is
recommended for best generation quality. The `openai` SDK is installed by
`requirements.txt`; configure the endpoint with environment variables:
```bash
export BERNINI_PE_API_KEY=... # or OPENAI_API_KEY
export BERNINI_PE_BASE_URL=... # or OPENAI_BASE_URL
export BERNINI_PE_MODEL=... # vision-capable chat model
```
### Examples by task type
Unless an example specifies otherwise, inference outputs **480p / 16fps** (the
defaults โ `--max_image_size 848`, `--fps 16`).
Each example runs a bundled case in
[`assets/testcases/`](assets/testcases/) โ replace `