engineerA314
/

Wan2.1-Fun-V1.1-1.3B-InP-Diffusers

WanImageToVideoPipeline

video-generation

Model card Files Files and versions

Wan2.1-Fun-V1.1-1.3B-InP-Diffusers / README.md

engineerA314's picture

Add model card

beb31fe verified 15 days ago

|

history blame contribute delete

2.77 kB

	---
	license: apache-2.0
	library_name: diffusers
	pipeline_tag: image-to-video
	tags:
	- wan
	- video-generation
	- image-to-video
	- diffusers
	base_model: alibaba-pai/Wan2.1-Fun-V1.1-1.3B-InP
	---

	# Wan2.1-Fun-V1.1-1.3B-InP (Diffusers)

	This is a diffusers-format conversion of [alibaba-pai/Wan2.1-Fun-V1.1-1.3B-InP](https://huggingface.co/alibaba-pai/Wan2.1-Fun-V1.1-1.3B-InP) (Wan-Fun Inpaint V1.1 1.3B) from VideoX-Fun format.

	## Model Details

	- Architecture: WanTransformer3DModel with `in_channels=36` (16 noise + 4 mask + 16 image latent)
	- Parameters: 1.3B
	- Pipeline: `WanImageToVideoPipeline` (standard diffusers, no patching required)
	- Resolution: 480x832 (480p) recommended
	- Frames: 49 frames at 16fps (~3 seconds)

	This model has the same I2V architecture as the official [Wan2.1-I2V-14B-480P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P-Diffusers) (`in_channels=36`), but at 1.3B scale.

	## Usage

	```python
	import torch
	from diffusers import WanImageToVideoPipeline
	from PIL import Image

	pipe = WanImageToVideoPipeline.from_pretrained(
	"engineerA314/Wan2.1-Fun-V1.1-1.3B-InP-Diffusers",
	torch_dtype=torch.bfloat16,
	)
	pipe.enable_sequential_cpu_offload()

	image = Image.open("first_frame.png").convert("RGB")

	output = pipe(
	image=image,
	prompt="A person is talking naturally",
	negative_prompt="static, blurred, low quality",
	height=480,
	width=832,
	num_frames=49,
	num_inference_steps=50,
	guidance_scale=5.0,
	)

	from diffusers.utils import export_to_video
	export_to_video(output.frames[0], "output.mp4", fps=16)
	```

	## Conversion Details

	Converted from VideoX-Fun format using 1:1 weight key mapping (983 keys). No architectural modifications were needed -- the standard `WanImageToVideoPipeline` handles `in_channels=36` natively.

	### Components

	\| Component \| Source \|
	\|-----------\|--------\|
	\| Transformer \| Converted from `alibaba-pai/Wan2.1-Fun-V1.1-1.3B-InP` \|
	\| VAE \| `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` \|
	\| Text Encoder \| `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` (UMT5-XXL) \|
	\| Image Encoder \| `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers` (CLIP ViT-H-14) \|
	\| Scheduler \| UniPCMultistepScheduler (`flow_shift=3.0`) \|

	### Comparison with TI2V variant

	\| \| This model (InP) \| [TI2V](https://huggingface.co/engineerA314/Wan2.1-Fun-V1.1-1.3B-TI2V-Diffusers) \|
	\|---\|---\|---\|
	\| `in_channels` \| 36 (noise + mask + image) \| 32 (noise + image) \|
	\| Pipeline patches \| None needed \| `prepare_latents` override required \|
	\| Origin \| Wan-Fun Inpaint \| Wan-Fun Camera Control (adapter removed) \|

	## Acknowledgements

	- [Alibaba PAI / VideoX-Fun](https://github.com/alibaba-pai/VideoX-Fun) for the original Wan-Fun models
	- [Wan-Video](https://github.com/Wan-Video/Wan2.1) for the Wan 2.1 architecture