Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_12249 /en /quantization /quanto.md

rtrm

20 days ago

preview code

download

raw

4.74 kB

	# Quanto

	[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/en/index). It has been designed with versatility and simplicity in mind:

	- All features are available in eager mode (works with non-traceable models)
	- Supports quantization aware training
	- Quantized models are compatible with `torch.compile`
	- Quantized models are Device agnostic (e.g CUDA,XPU,MPS,CPU)

	In order to use the Quanto backend, you will first need to install `optimum-quanto>=0.2.6` and `accelerate`

	```shell
	pip install optimum-quanto accelerate
	```

	Now you can quantize a model by passing the `QuantoConfig` object to the `from_pretrained()` method. Although the Quanto library does allow quantizing `nn.Conv2d` and `nn.LayerNorm` modules, currently, Diffusers only supports quantizing the weights in the `nn.Linear` layers of a model. The following snippet demonstrates how to apply `float8` quantization with Quanto.

	```python
	import torch
	from diffusers import FluxTransformer2DModel, QuantoConfig

	model_id = "black-forest-labs/FLUX.1-dev"
	quantization_config = QuantoConfig(weights_dtype="float8")
	transformer = FluxTransformer2DModel.from_pretrained(
	model_id,
	subfolder="transformer",
	quantization_config=quantization_config,
	torch_dtype=torch.bfloat16,
	)

	pipe = FluxPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch_dtype)
	pipe.to("cuda")

	prompt = "A cat holding a sign that says hello world"
	image = pipe(
	prompt, num_inference_steps=50, guidance_scale=4.5, max_sequence_length=512
	).images[0]
	image.save("output.png")
	```

	## Skipping Quantization on specific modules

	It is possible to skip applying quantization on certain modules using the `modules_to_not_convert` argument in the `QuantoConfig`. Please ensure that the modules passed in to this argument match the keys of the modules in the `state_dict`

	```python
	import torch
	from diffusers import FluxTransformer2DModel, QuantoConfig

	model_id = "black-forest-labs/FLUX.1-dev"
	quantization_config = QuantoConfig(weights_dtype="float8", modules_to_not_convert=["proj_out"])
	transformer = FluxTransformer2DModel.from_pretrained(
	model_id,
	subfolder="transformer",
	quantization_config=quantization_config,
	torch_dtype=torch.bfloat16,
	)
	```

	## Using `from_single_file` with the Quanto Backend

	`QuantoConfig` is compatible with `~FromOriginalModelMixin.from_single_file`.

	```python
	import torch
	from diffusers import FluxTransformer2DModel, QuantoConfig

	ckpt_path = "https://huggingface.co/black-forest-labs/FLUX.1-dev/blob/main/flux1-dev.safetensors"
	quantization_config = QuantoConfig(weights_dtype="float8")
	transformer = FluxTransformer2DModel.from_single_file(ckpt_path, quantization_config=quantization_config, torch_dtype=torch.bfloat16)
	```

	## Saving Quantized models

	Diffusers supports serializing Quanto models using the `~ModelMixin.save_pretrained` method.

	The serialization and loading requirements are different for models quantized directly with the Quanto library and models quantized
	with Diffusers using Quanto as the backend. It is currently not possible to load models quantized directly with Quanto into Diffusers using `~ModelMixin.from_pretrained`

	```python
	import torch
	from diffusers import FluxTransformer2DModel, QuantoConfig

	model_id = "black-forest-labs/FLUX.1-dev"
	quantization_config = QuantoConfig(weights_dtype="float8")
	transformer = FluxTransformer2DModel.from_pretrained(
	model_id,
	subfolder="transformer",
	quantization_config=quantization_config,
	torch_dtype=torch.bfloat16,
	)
	# save quantized model to reuse
	transformer.save_pretrained("")

	# you can reload your quantized model with
	model = FluxTransformer2DModel.from_pretrained("")
	```

	## Using `torch.compile` with Quanto

	Currently the Quanto backend supports `torch.compile` for the following quantization types:

	- `int8` weights

	```python
	import torch
	from diffusers import FluxPipeline, FluxTransformer2DModel, QuantoConfig

	model_id = "black-forest-labs/FLUX.1-dev"
	quantization_config = QuantoConfig(weights_dtype="int8")
	transformer = FluxTransformer2DModel.from_pretrained(
	model_id,
	subfolder="transformer",
	quantization_config=quantization_config,
	torch_dtype=torch.bfloat16,
	)
	transformer = torch.compile(transformer, mode="max-autotune", fullgraph=True)

	pipe = FluxPipeline.from_pretrained(
	model_id, transformer=transformer, torch_dtype=torch_dtype
	)
	pipe.to("cuda")
	images = pipe("A cat holding a sign that says hello").images[0]
	images.save("flux-quanto-compile.png")
	```

	## Supported Quantization Types

	### Weights

	- float8
	- int8
	- int4
	- int2

Xet Storage Details

Size:: 4.74 kB
Xet hash:: 39d20eeabc3c3d0eb6a2fa645c8a45be7bb86d64e53b13306c1219f756009ccf

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.