Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / diffusers /pr_12595 /en /api /cache.md

rtrm

15 days ago

preview code

download

raw

23.6 kB

	# Caching methods

	Cache methods speedup diffusion transformers by storing and reusing intermediate outputs of specific layers, such as attention and feedforward layers, instead of recalculating them at each inference step.

	## CacheMixin[[diffusers.CacheMixin]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class diffusers.CacheMixin</name><anchor>diffusers.CacheMixin</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/cache_utils.py#L23</source><parameters>[]</parameters></docstring>

	A class for enable/disabling caching techniques on diffusion models.

	Supported caching techniques:
	- [Pyramid Attention Broadcast](https://huggingface.co/papers/2408.12588)
	- [FasterCache](https://huggingface.co/papers/2410.19355)
	- [FirstBlockCache](https://github.com/chengzeyi/ParaAttention/blob/7a266123671b55e7e5a2fe9af3121f07a36afc78/README.md#first-block-cache-our-dynamic-caching)



	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>cache_context</name><anchor>diffusers.CacheMixin.cache_context</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/cache_utils.py#L120</source><parameters>[{"name": "name", "val": ": str"}]</parameters></docstring>
	Context manager that provides additional methods for cache management.

	</div>
	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>enable_cache</name><anchor>diffusers.CacheMixin.enable_cache</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/models/cache_utils.py#L39</source><parameters>[{"name": "config", "val": ""}]</parameters><paramsdesc>- config (`Union[PyramidAttentionBroadcastConfig]`) --
	The configuration for applying the caching technique. Currently supported caching techniques are:
	- [PyramidAttentionBroadcastConfig](/docs/diffusers/pr_12595/en/api/cache#diffusers.PyramidAttentionBroadcastConfig)</paramsdesc><paramgroups>0</paramgroups></docstring>

	Enable caching techniques on the model.



	<ExampleCodeBlock anchor="diffusers.CacheMixin.enable_cache.example">

	Example:

	```python
	>>> import torch
	>>> from diffusers import CogVideoXPipeline, PyramidAttentionBroadcastConfig

	>>> pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
	>>> pipe.to("cuda")

	>>> config = PyramidAttentionBroadcastConfig(
	... spatial_attention_block_skip_range=2,
	... spatial_attention_timestep_skip_range=(100, 800),
	... current_timestep_callback=lambda: pipe.current_timestep,
	... )
	>>> pipe.transformer.enable_cache(config)
	```

	</ExampleCodeBlock>


	</div></div>

	## PyramidAttentionBroadcastConfig[[diffusers.PyramidAttentionBroadcastConfig]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class diffusers.PyramidAttentionBroadcastConfig</name><anchor>diffusers.PyramidAttentionBroadcastConfig</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/hooks/pyramid_attention_broadcast.py#L40</source><parameters>[{"name": "spatial_attention_block_skip_range", "val": ": typing.Optional[int] = None"}, {"name": "temporal_attention_block_skip_range", "val": ": typing.Optional[int] = None"}, {"name": "cross_attention_block_skip_range", "val": ": typing.Optional[int] = None"}, {"name": "spatial_attention_timestep_skip_range", "val": ": typing.Tuple[int, int] = (100, 800)"}, {"name": "temporal_attention_timestep_skip_range", "val": ": typing.Tuple[int, int] = (100, 800)"}, {"name": "cross_attention_timestep_skip_range", "val": ": typing.Tuple[int, int] = (100, 800)"}, {"name": "spatial_attention_block_identifiers", "val": ": typing.Tuple[str, ...] = ('blocks', 'transformer_blocks', 'single_transformer_blocks', 'layers')"}, {"name": "temporal_attention_block_identifiers", "val": ": typing.Tuple[str, ...] = ('temporal_transformer_blocks',)"}, {"name": "cross_attention_block_identifiers", "val": ": typing.Tuple[str, ...] = ('blocks', 'transformer_blocks', 'layers')"}, {"name": "current_timestep_callback", "val": ": typing.Callable[[], int] = None"}]</parameters><paramsdesc>- spatial_attention_block_skip_range (`int`, optional, defaults to `None`) --
	The number of times a specific spatial attention broadcast is skipped before computing the attention states
	to re-use. If this is set to the value `N`, the attention computation will be skipped `N - 1` times (i.e.,
	old attention states will be reused) before computing the new attention states again.
	- temporal_attention_block_skip_range (`int`, optional, defaults to `None`) --
	The number of times a specific temporal attention broadcast is skipped before computing the attention
	states to re-use. If this is set to the value `N`, the attention computation will be skipped `N - 1` times
	(i.e., old attention states will be reused) before computing the new attention states again.
	- cross_attention_block_skip_range (`int`, optional, defaults to `None`) --
	The number of times a specific cross-attention broadcast is skipped before computing the attention states
	to re-use. If this is set to the value `N`, the attention computation will be skipped `N - 1` times (i.e.,
	old attention states will be reused) before computing the new attention states again.
	- spatial_attention_timestep_skip_range (`Tuple[int, int]`, defaults to `(100, 800)`) --
	The range of timesteps to skip in the spatial attention layer. The attention computations will be
	conditionally skipped if the current timestep is within the specified range.
	- temporal_attention_timestep_skip_range (`Tuple[int, int]`, defaults to `(100, 800)`) --
	The range of timesteps to skip in the temporal attention layer. The attention computations will be
	conditionally skipped if the current timestep is within the specified range.
	- cross_attention_timestep_skip_range (`Tuple[int, int]`, defaults to `(100, 800)`) --
	The range of timesteps to skip in the cross-attention layer. The attention computations will be
	conditionally skipped if the current timestep is within the specified range.
	- spatial_attention_block_identifiers (`Tuple[str, ...]`) --
	The identifiers to match against the layer names to determine if the layer is a spatial attention layer.
	- temporal_attention_block_identifiers (`Tuple[str, ...]`) --
	The identifiers to match against the layer names to determine if the layer is a temporal attention layer.
	- cross_attention_block_identifiers (`Tuple[str, ...]`) --
	The identifiers to match against the layer names to determine if the layer is a cross-attention layer.</paramsdesc><paramgroups>0</paramgroups></docstring>

	Configuration for Pyramid Attention Broadcast.




	</div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>diffusers.apply_pyramid_attention_broadcast</name><anchor>diffusers.apply_pyramid_attention_broadcast</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/hooks/pyramid_attention_broadcast.py#L181</source><parameters>[{"name": "module", "val": ": Module"}, {"name": "config", "val": ": PyramidAttentionBroadcastConfig"}]</parameters><paramsdesc>- module (`torch.nn.Module`) --
	The module to apply Pyramid Attention Broadcast to.
	- config (`Optional[PyramidAttentionBroadcastConfig]`, `optional`, defaults to `None`) --
	The configuration to use for Pyramid Attention Broadcast.</paramsdesc><paramgroups>0</paramgroups></docstring>

	Apply [Pyramid Attention Broadcast](https://huggingface.co/papers/2408.12588) to a given pipeline.

	PAB is an attention approximation method that leverages the similarity in attention states between timesteps to
	reduce the computational cost of attention computation. The key takeaway from the paper is that the attention
	similarity in the cross-attention layers between timesteps is high, followed by less similarity in the temporal and
	spatial layers. This allows for the skipping of attention computation in the cross-attention layers more frequently
	than in the temporal and spatial layers. Applying PAB will, therefore, speedup the inference process.



	<ExampleCodeBlock anchor="diffusers.apply_pyramid_attention_broadcast.example">

	Example:

	```python
	>>> import torch
	>>> from diffusers import CogVideoXPipeline, PyramidAttentionBroadcastConfig, apply_pyramid_attention_broadcast
	>>> from diffusers.utils import export_to_video

	>>> pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
	>>> pipe.to("cuda")

	>>> config = PyramidAttentionBroadcastConfig(
	... spatial_attention_block_skip_range=2,
	... spatial_attention_timestep_skip_range=(100, 800),
	... current_timestep_callback=lambda: pipe.current_timestep,
	... )
	>>> apply_pyramid_attention_broadcast(pipe.transformer, config)
	```

	</ExampleCodeBlock>


	</div>

	## FasterCacheConfig[[diffusers.FasterCacheConfig]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class diffusers.FasterCacheConfig</name><anchor>diffusers.FasterCacheConfig</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/hooks/faster_cache.py#L50</source><parameters>[{"name": "spatial_attention_block_skip_range", "val": ": int = 2"}, {"name": "temporal_attention_block_skip_range", "val": ": typing.Optional[int] = None"}, {"name": "spatial_attention_timestep_skip_range", "val": ": typing.Tuple[int, int] = (-1, 681)"}, {"name": "temporal_attention_timestep_skip_range", "val": ": typing.Tuple[int, int] = (-1, 681)"}, {"name": "low_frequency_weight_update_timestep_range", "val": ": typing.Tuple[int, int] = (99, 901)"}, {"name": "high_frequency_weight_update_timestep_range", "val": ": typing.Tuple[int, int] = (-1, 301)"}, {"name": "alpha_low_frequency", "val": ": float = 1.1"}, {"name": "alpha_high_frequency", "val": ": float = 1.1"}, {"name": "unconditional_batch_skip_range", "val": ": int = 5"}, {"name": "unconditional_batch_timestep_skip_range", "val": ": typing.Tuple[int, int] = (-1, 641)"}, {"name": "spatial_attention_block_identifiers", "val": ": typing.Tuple[str, ...] = ('^blocks.attn', '^transformer_blocks.attn', '^single_transformer_blocks.attn')"}, {"name": "temporal_attention_block_identifiers", "val": ": typing.Tuple[str, ...] = ('^temporal_transformer_blocks.attn',)"}, {"name": "attention_weight_callback", "val": ": typing.Callable[[torch.nn.modules.module.Module], float] = None"}, {"name": "low_frequency_weight_callback", "val": ": typing.Callable[[torch.nn.modules.module.Module], float] = None"}, {"name": "high_frequency_weight_callback", "val": ": typing.Callable[[torch.nn.modules.module.Module], float] = None"}, {"name": "tensor_format", "val": ": str = 'BCFHW'"}, {"name": "is_guidance_distilled", "val": ": bool = False"}, {"name": "current_timestep_callback", "val": ": typing.Callable[[], int] = None"}, {"name": "_unconditional_conditional_input_kwargs_identifiers", "val": ": typing.List[str] = ('hidden_states', 'encoder_hidden_states', 'timestep', 'attention_mask', 'encoder_attention_mask')"}]</parameters><paramsdesc>- spatial_attention_block_skip_range (`int`, defaults to `2`) --
	Calculate the attention states every `N` iterations. If this is set to `N`, the attention computation will
	be skipped `N - 1` times (i.e., cached attention states will be reused) before computing the new attention
	states again.
	- temporal_attention_block_skip_range (`int`, optional, defaults to `None`) --
	Calculate the attention states every `N` iterations. If this is set to `N`, the attention computation will
	be skipped `N - 1` times (i.e., cached attention states will be reused) before computing the new attention
	states again.
	- spatial_attention_timestep_skip_range (`Tuple[float, float]`, defaults to `(-1, 681)`) --
	The timestep range within which the spatial attention computation can be skipped without a significant loss
	in quality. This is to be determined by the user based on the underlying model. The first value in the
	tuple is the lower bound and the second value is the upper bound. Typically, diffusion timesteps for
	denoising are in the reversed range of 0 to 1000 (i.e. denoising starts at timestep 1000 and ends at
	timestep 0). For the default values, this would mean that the spatial attention computation skipping will
	be applicable only after denoising timestep 681 is reached, and continue until the end of the denoising
	process.
	- temporal_attention_timestep_skip_range (`Tuple[float, float]`, optional, defaults to `None`) --
	The timestep range within which the temporal attention computation can be skipped without a significant
	loss in quality. This is to be determined by the user based on the underlying model. The first value in the
	tuple is the lower bound and the second value is the upper bound. Typically, diffusion timesteps for
	denoising are in the reversed range of 0 to 1000 (i.e. denoising starts at timestep 1000 and ends at
	timestep 0).
	- low_frequency_weight_update_timestep_range (`Tuple[int, int]`, defaults to `(99, 901)`) --
	The timestep range within which the low frequency weight scaling update is applied. The first value in the
	tuple is the lower bound and the second value is the upper bound of the timestep range. The callback
	function for the update is called only within this range.
	- high_frequency_weight_update_timestep_range (`Tuple[int, int]`, defaults to `(-1, 301)`) --
	The timestep range within which the high frequency weight scaling update is applied. The first value in the
	tuple is the lower bound and the second value is the upper bound of the timestep range. The callback
	function for the update is called only within this range.
	- alpha_low_frequency (`float`, defaults to `1.1`) --
	The weight to scale the low frequency updates by. This is used to approximate the unconditional branch from
	the conditional branch outputs.
	- alpha_high_frequency (`float`, defaults to `1.1`) --
	The weight to scale the high frequency updates by. This is used to approximate the unconditional branch
	from the conditional branch outputs.
	- unconditional_batch_skip_range (`int`, defaults to `5`) --
	Process the unconditional branch every `N` iterations. If this is set to `N`, the unconditional branch
	computation will be skipped `N - 1` times (i.e., cached unconditional branch states will be reused) before
	computing the new unconditional branch states again.
	- unconditional_batch_timestep_skip_range (`Tuple[float, float]`, defaults to `(-1, 641)`) --
	The timestep range within which the unconditional branch computation can be skipped without a significant
	loss in quality. This is to be determined by the user based on the underlying model. The first value in the
	tuple is the lower bound and the second value is the upper bound.
	- spatial_attention_block_identifiers (`Tuple[str, ...]`, defaults to `("blocks.attn1", "transformer_blocks.attn1", "single_transformer_blocks.*attn1")`) --
	The identifiers to match the spatial attention blocks in the model. If the name of the block contains any
	of these identifiers, FasterCache will be applied to that block. This can either be the full layer names,
	partial layer names, or regex patterns. Matching will always be done using a regex match.
	- temporal_attention_block_identifiers (`Tuple[str, ...]`, defaults to `("temporal_transformer_blocks.*attn1",)`) --
	The identifiers to match the temporal attention blocks in the model. If the name of the block contains any
	of these identifiers, FasterCache will be applied to that block. This can either be the full layer names,
	partial layer names, or regex patterns. Matching will always be done using a regex match.
	- attention_weight_callback (`Callable[[torch.nn.Module], float]`, defaults to `None`) --
	The callback function to determine the weight to scale the attention outputs by. This function should take
	the attention module as input and return a float value. This is used to approximate the unconditional
	branch from the conditional branch outputs. If not provided, the default weight is 0.5 for all timesteps.
	Typically, as described in the paper, this weight should gradually increase from 0 to 1 as the inference
	progresses. Users are encouraged to experiment and provide custom weight schedules that take into account
	the number of inference steps and underlying model behaviour as denoising progresses.
	- low_frequency_weight_callback (`Callable[[torch.nn.Module], float]`, defaults to `None`) --
	The callback function to determine the weight to scale the low frequency updates by. If not provided, the
	default weight is 1.1 for timesteps within the range specified (as described in the paper).
	- high_frequency_weight_callback (`Callable[[torch.nn.Module], float]`, defaults to `None`) --
	The callback function to determine the weight to scale the high frequency updates by. If not provided, the
	default weight is 1.1 for timesteps within the range specified (as described in the paper).
	- tensor_format (`str`, defaults to `"BCFHW"`) --
	The format of the input tensors. This should be one of `"BCFHW"`, `"BFCHW"`, or `"BCHW"`. The format is
	used to split individual latent frames in order for low and high frequency components to be computed.
	- is_guidance_distilled (`bool`, defaults to `False`) --
	Whether the model is guidance distilled or not. If the model is guidance distilled, FasterCache will not be
	applied at the denoiser-level to skip the unconditional branch computation (as there is none).
	- _unconditional_conditional_input_kwargs_identifiers (`List[str]`, defaults to `("hidden_states", "encoder_hidden_states", "timestep", "attention_mask", "encoder_attention_mask")`) --
	The identifiers to match the input kwargs that contain the batchwise-concatenated unconditional and
	conditional inputs. If the name of the input kwargs contains any of these identifiers, FasterCache will
	split the inputs into unconditional and conditional branches. This must be a list of exact input kwargs
	names that contain the batchwise-concatenated unconditional and conditional inputs.</paramsdesc><paramgroups>0</paramgroups></docstring>

	Configuration for [FasterCache](https://huggingface.co/papers/2410.19355).




	</div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>diffusers.apply_faster_cache</name><anchor>diffusers.apply_faster_cache</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/hooks/faster_cache.py#L486</source><parameters>[{"name": "module", "val": ": Module"}, {"name": "config", "val": ": FasterCacheConfig"}]</parameters><paramsdesc>- module (`torch.nn.Module`) --
	The pytorch module to apply FasterCache to. Typically, this should be a transformer architecture supported
	in Diffusers, such as `CogVideoXTransformer3DModel`, but external implementations may also work.
	- config (`FasterCacheConfig`) --
	The configuration to use for FasterCache.</paramsdesc><paramgroups>0</paramgroups></docstring>

	Applies [FasterCache](https://huggingface.co/papers/2410.19355) to a given pipeline.



	<ExampleCodeBlock anchor="diffusers.apply_faster_cache.example">

	Example:
	```python
	>>> import torch
	>>> from diffusers import CogVideoXPipeline, FasterCacheConfig, apply_faster_cache

	>>> pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
	>>> pipe.to("cuda")

	>>> config = FasterCacheConfig(
	... spatial_attention_block_skip_range=2,
	... spatial_attention_timestep_skip_range=(-1, 681),
	... low_frequency_weight_update_timestep_range=(99, 641),
	... high_frequency_weight_update_timestep_range=(-1, 301),
	... spatial_attention_block_identifiers=["transformer_blocks"],
	... attention_weight_callback=lambda _: 0.3,
	... tensor_format="BFCHW",
	... )
	>>> apply_faster_cache(pipe.transformer, config)
	```

	</ExampleCodeBlock>


	</div>

	### FirstBlockCacheConfig[[diffusers.FirstBlockCacheConfig]]

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>class diffusers.FirstBlockCacheConfig</name><anchor>diffusers.FirstBlockCacheConfig</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/hooks/first_block_cache.py#L34</source><parameters>[{"name": "threshold", "val": ": float = 0.05"}]</parameters><paramsdesc>- threshold (`float`, defaults to `0.05`) --
	The threshold to determine whether or not a forward pass through all layers of the model is required. A
	higher threshold usually results in a forward pass through a lower number of layers and faster inference,
	but might lead to poorer generation quality. A lower threshold may not result in significant generation
	speedup. The threshold is compared against the absmean difference of the residuals between the current and
	cached outputs from the first transformer block. If the difference is below the threshold, the forward pass
	is skipped.</paramsdesc><paramgroups>0</paramgroups></docstring>

	Configuration for [First Block
	Cache](https://github.com/chengzeyi/ParaAttention/blob/7a266123671b55e7e5a2fe9af3121f07a36afc78/README.md#first-block-cache-our-dynamic-caching).




	</div>

	<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


	<docstring><name>diffusers.apply_first_block_cache</name><anchor>diffusers.apply_first_block_cache</anchor><source>https://github.com/huggingface/diffusers/blob/vr_12595/src/diffusers/hooks/first_block_cache.py#L194</source><parameters>[{"name": "module", "val": ": Module"}, {"name": "config", "val": ": FirstBlockCacheConfig"}]</parameters><paramsdesc>- module (`torch.nn.Module`) --
	The pytorch module to apply FBCache to. Typically, this should be a transformer architecture supported in
	Diffusers, such as `CogVideoXTransformer3DModel`, but external implementations may also work.
	- config (`FirstBlockCacheConfig`) --
	The configuration to use for applying the FBCache method.</paramsdesc><paramgroups>0</paramgroups></docstring>

	Applies [First Block
	Cache](https://github.com/chengzeyi/ParaAttention/blob/4de137c5b96416489f06e43e19f2c14a772e28fd/README.md#first-block-cache-our-dynamic-caching)
	to a given module.

	First Block Cache builds on the ideas of [TeaCache](https://huggingface.co/papers/2411.19108). It is much simpler
	to implement generically for a wide range of models and has been integrated first for experimental purposes.



	<ExampleCodeBlock anchor="diffusers.apply_first_block_cache.example">

	Example:
	```python
	>>> import torch
	>>> from diffusers import CogView4Pipeline
	>>> from diffusers.hooks import apply_first_block_cache, FirstBlockCacheConfig

	>>> pipe = CogView4Pipeline.from_pretrained("THUDM/CogView4-6B", torch_dtype=torch.bfloat16)
	>>> pipe.to("cuda")

	>>> apply_first_block_cache(pipe.transformer, FirstBlockCacheConfig(threshold=0.2))

	>>> prompt = "A photo of an astronaut riding a horse on mars"
	>>> image = pipe(prompt, generator=torch.Generator().manual_seed(42)).images[0]
	>>> image.save("output.png")
	```

	</ExampleCodeBlock>


	</div>

	<EditOnGithub source="https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/cache.md" />

Xet Storage Details

Size:: 23.6 kB
Xet hash:: 4809c8df04624d95f8bd36be4cbf87ca8491b48e782d291f29ee00121e6cf052

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.