Fix Diffusers class metadata warning

24b949e verified 13 days ago

5.4 kB

	---
	license: other
	license_name: circlestone-labs-non-commercial-license
	base_model:
	- circlestone-labs/Anima
	pipeline_tag: text-to-image
	library_name: diffusers
	tags:
	- diffusers
	- safetensors
	- sdnq
	- anima
	- cosmos
	- text-to-image
	- uint4
	---

	# Anima Preview 3 SDNQ UINT4 Diffusers Checkpoint

	4-bit uint4 static SDNQ quantization of the Anima Preview 3 diffusion transformer, packaged as a full Diffusers pipeline. This is the smallest checkpoint and lowest VRAM footprint in this comparison; the companion checkpoints are listed in the benchmark table below.

	This repository is a separate full Diffusers checkpoint for `circlestone-labs/Anima` Preview 3. The pipeline code and non-transformer components are based on the public Diffusers conversion `CalamitousFelicitousness/Anima-Preview-3-sdnext-diffusers`. The `transformer/` component is the WaveCut SDNQ-quantized diffusion transformer converted from `WaveCut/Anima-Preview-3-SDNQ-uint4`.

	## Components

	- `transformer/`: SDNQ `uint4` quantized `CosmosTransformer3DModel`.
	- `llm_adapter/`: Anima LLM adapter required by the native Anima architecture.
	- `text_encoder/`: Qwen3 0.6B text encoder from the Diffusers conversion.
	- `tokenizer/` and `t5_tokenizer/`: Qwen and T5 tokenizers used by the adapter pathway.
	- `vae/`: Qwen Image / Wan-style VAE used by Anima.
	- `scheduler/`: `FlowMatchEulerDiscreteScheduler` with shift 3.0.

	## Usage

	Install current Diffusers/Transformers plus SDNQ support, then load the pipeline:

	```python
	import torch
	import sdnq
	from diffusers import DiffusionPipeline

	pipe = DiffusionPipeline.from_pretrained(
	"WaveCut/Anima-Preview-3-SDNQ-uint4-diffusers",
	custom_pipeline="pipeline",
	torch_dtype=torch.bfloat16,
	trust_remote_code=True,
	).to("cuda")

	prompt = "masterpiece, best quality, score_7, safe, 1girl, fern (sousou no frieren), purple hair, purple eyes, black robe, white dress, butterfly on hand, simple background, looking at viewer"
	negative_prompt = "worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, artist name"

	image = pipe(
	prompt=prompt,
	negative_prompt=negative_prompt,
	width=1024,
	height=1024,
	num_inference_steps=30,
	guidance_scale=4.0,
	generator=torch.Generator(device="cuda").manual_seed(424242),
	).images[0]
	```

	Because the Anima pipeline is custom code, pass `custom_pipeline="pipeline"`; `trust_remote_code=True` allows Diffusers to load `pipeline.py` from this repo.

	## Prompting

	Anima was trained on Danbooru-style tags, natural language captions, and mixtures of both. The upstream Anima Preview 3 card recommends about 1MP generation, for example `1024x1024`, `896x1152`, or `1152x896`, with roughly 30-50 steps and CFG 4-5.

	Recommended positive prefix:

	```text
	masterpiece, best quality, score_7, safe,
	```

	Recommended negative prompt:

	```text
	worst quality, low quality, score_1, score_2, score_3, artist name
	```

	Use lowercase tags with spaces instead of underscores, except score tags such as `score_7`. For artist tags, prefix the artist with `@`.

	## 1024x1024 Comparison Grid

	Five prompt/seed pairs were generated with the original BF16 Diffusers checkpoint, this UINT4 checkpoint, and the companion INT8 checkpoint. The source JPEG is `3572x5576`; every generated cell is exactly `1024x1024` and pasted 1:1 with no resizing.

	![Anima Original BF16 vs SDNQ UINT4 and INT8 1024x1024 grid](images/anima_original_uint4_int8_grid_5x3_1024x1024_1to1.jpg)

	Prompt IDs and seeds are printed in the left column of the grid. Raw benchmark data is available in [`benchmarks/benchmark_results_1024.json`](benchmarks/benchmark_results_1024.json).

	## Benchmark

	Measured on an RTX 5090 32GB with `torch 2.8.0+cu128`, `diffusers 0.38.0`, `transformers 5.8.1`, `sdnq 0.1.8`, `torch.bfloat16`, 24 steps, CFG 4.0, and 1024x1024 output. Network download is excluded. Each model was loaded in a separate process; one 1024x1024 warm-up image was discarded, then five prompt/seed pairs were measured. VRAM was sampled with `nvidia-smi` every 50 ms.

	\| Model \| Repo \| Size \| Load time \| Mean generation \| Speed vs original \| VRAM after load \| Peak VRAM while generating \|
	\| --- \| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| Original BF16 \| `CalamitousFelicitousness/Anima-Preview-3-sdnext-diffusers` \| 5.3 GiB \| 10.04s \| 6.37s/img \| 1.00x \| 6005 MiB \| 10759 MiB \|
	\| SDNQ UINT4 \| `WaveCut/Anima-Preview-3-SDNQ-uint4-diffusers` \| 2.7 GiB (-49.1%) \| 11.96s \| 6.13s/img \| 1.04x (+3.9%) \| 3285 MiB (-45.3%) \| 8157 MiB (-24.2%) \|
	\| SDNQ INT8 \| `WaveCut/Anima-Preview-3-SDNQ-int8-diffusers` \| 3.5 GiB (-34.1%) \| 22.41s \| 4.60s/img \| 1.38x (+38.4%) \| 4111 MiB (-31.5%) \| 8961 MiB (-16.7%) \|

	Quant-to-quant tradeoff in this run: UINT4 is 22.7% smaller than INT8 and uses 826 MiB less VRAM after load plus 804 MiB less peak generation VRAM. INT8 is 1.33x faster than UINT4 on this RTX 5090 setup.

	## Notes

	The original Anima split checkpoint is a ComfyUI-native model with a Qwen3 text encoder and a learned LLM adapter. Earlier transformer-only exports that load the checkpoint directly as `CosmosTransformer3DModel` ignore the `llm_adapter.*` weights; this repo keeps the adapter and full pipeline structure so generation follows the Anima architecture.

	License follows the upstream Anima/CircleStone non-commercial license and the NVIDIA Cosmos derivative terms referenced by the upstream model card.