# Continuous batching

This page documents the classes behind continuous batching inference: submitting prompts, configuring scheduling and memory limits, and retrieving results.

For usage examples, see the [Continuous batching](../continuous_batching) guide and for how scheduling and memory interact, see the [Continuous batching architecture](../continuous_batching_architecture) doc.

## ContinuousMixin.generate_batch[[transformers.ContinuousMixin.generate_batch]]

#### transformers.ContinuousMixin.generate_batch[[transformers.ContinuousMixin.generate_batch]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L1264)

Generate sequences for a batch of prompts using continuous batching.

**Parameters:**

inputs : List of input token sequences (prompts)

generation_config : Optional generation configuration

continuous_batching_config : Optional continuous batching configuration

record_timestamps : If set to true, the requests will have a timestamp for each token generated

progress_bar : If set to true, a progress bar will be displayed

persistent_manager : whether to persist the manager after the generation is finished. Default is False.

warmup : whether to pre-capture CUDA graphs before processing requests. Default is True.

- ****kwargs** : Additional generation parameters. Only max_new_tokens is used, but other deprecated arguments are extracted and passed to the continuous_batching_config object.

**Returns:**

``dict[str, GenerationOutput]``

a dictionary of request ids to GenerationOutput objects

## ContinuousBatchingManager[[transformers.ContinuousBatchingManager]]

#### transformers.ContinuousBatchingManager[[transformers.ContinuousBatchingManager]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L729)

Manager for handling continuous batching of generation requests. It provides a user interface for submitting
generation requests, retrieving results, and managing the background generation thread. This class should not be
created directly, but through one of the following entry points (all methods of the `ContinuousMixin` mixin):
- `init_continuous_batching`
- `continuous_batching_context_manager`
- `generate_batch`

add_requesttransformers.ContinuousBatchingManager.add_requesthttps://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L876[{"name": "input_ids", "val": ": list"}, {"name": "request_id", "val": ": str | None = None"}, {"name": "max_new_tokens", "val": ": int | None = None"}, {"name": "streaming", "val": ": bool = False"}, {"name": "record_timestamps", "val": ": bool = False"}, {"name": "eos_token_id", "val": ": int | list[int] | None = None"}, {"name": "**logit_processor_kwargs", "val": ": typing.Any"}]- **input_ids** -- Input token IDs to use as prompt
- **request_id** -- Optional custom request ID (auto-generated if None)
- **max_new_tokens** -- Maximum number of new tokens to generate
- **streaming** -- Whether to stream tokens as they're generated
- **record_timestamps** -- Whether to record timestamps for each generated token
- **eos_token_id** -- End-of-sequence token ID(s)
- **logit_processor_kwargs** -- Keyword arguments for the logits processor.0strThe request ID
Add a new generation request to the queue.

**Parameters:**

input_ids : Input token IDs to use as prompt

request_id : Optional custom request ID (auto-generated if None)

max_new_tokens : Maximum number of new tokens to generate

streaming : Whether to stream tokens as they're generated

record_timestamps : Whether to record timestamps for each generated token

eos_token_id : End-of-sequence token ID(s)

logit_processor_kwargs : Keyword arguments for the logits processor.

**Returns:**

`str`

The request ID
#### cancel_request[[transformers.ContinuousBatchingManager.cancel_request]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L958)

Cancel a request by its ID.

**Parameters:**

request_id : The ID of the request to cancel
#### get_result[[transformers.ContinuousBatchingManager.get_result]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L968)

Retrieve one result from the output queue.

**Parameters:**

request_id : If set, only return results matching this ID (others are requeued).

timeout : Maximum time to wait for a result.

**Returns:**

`Optional[GenerationOutput]`

The result data or None if timeout.
#### is_running[[transformers.ContinuousBatchingManager.is_running]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L808)

Check if the background generation thread is running.
#### join[[transformers.ContinuousBatchingManager.join]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L861)

Wait for the background thread to finish.

**Parameters:**

timeout : Maximum time to wait for the thread to stop
#### register_result_handler[[transformers.ContinuousBatchingManager.register_result_handler]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L1009)

Register a callback for result delivery (streaming or non-streaming).

The callback is invoked on the event loop via `call_soon_threadsafe`
each time a result is produced for this request. For streaming requests,
this happens on every token; for non-streaming, only on completion.

The handler is automatically cleaned up when the request finishes.

**Parameters:**

request_id (*str*) : The request ID to receive outputs for.

callback (*callable*) : Called with a `GenerationOutput` for each result.
#### request_id_iter[[transformers.ContinuousBatchingManager.request_id_iter]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L996)

Iterate over results matching a specific request id (blocking).

Uses the shared output queue with requeue. For high-concurrency serving,
use `register_result_handler` instead.
#### start[[transformers.ContinuousBatchingManager.start]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L798)

Start the background generation thread.
#### stop[[transformers.ContinuousBatchingManager.stop]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L821)

Signal the background thread to stop.

**Parameters:**

block : Whether to wait for the thread to stop

timeout : Maximum time to wait for the thread to stop

keep_for_next_session : Whether to cache this on the model for future use
#### warmup[[transformers.ContinuousBatchingManager.warmup]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/continuous_batching/continuous_api.py#L812)

Pre-capture CUDA graphs for varlen and decode paths by running dummy batches. Initializes the batch
processor if not already done.

## Continuous batching config[[transformers.ContinuousBatchingConfig]]

#### transformers.ContinuousBatchingConfig[[transformers.ContinuousBatchingConfig]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/generation/configuration_utils.py#L1607)

Class that holds arguments relative to continuous batching, when using continuous batching through the
`generate_batch` method or the `continuous_batching_context_manager` context manager.

__call__transformers.ContinuousBatchingConfig.__call__[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}]
Call self as a function.

**Parameters:**

block_size (`int`, *optional*, defaults to 256) : Size of each KV cache block in tokens.

num_blocks (`int`, *optional*) : Number of blocks in the KV cache. Auto-inferred from GPU memory when `None`.

max_batch_tokens (`int`, *optional*) : Maximum number of tokens in a batch. Auto-inferred from GPU memory when `None`.

max_memory_percent (`float`, *optional*) : Maximum percentage of free GPU memory (after the model is loaded) to use for the KV cache. When `None`, resolved at runtime to 0.9 if there is no logit processing and 0.8 if there is, to leave headroom for vocabulary-sized temporary tensors.

max_blocks_per_request (`int`, *optional*) : Maximum blocks per request, used in the `flash_attn_with_kvcache` fast decode path to dimension the block table. Setting this to 0 disables the fast decode path. Default is None (auto-inferred).

allow_block_sharing (`bool`, *optional*, defaults to `True`) : Whether to allow block sharing for prefix caching. Block sharing can only be allowed, never forced, as some models do not support it. Disable if you have few short prompts but long generation lengths.

use_async_batching (`bool`, *optional*) : Whether to enable async double-buffering, which removes CPU overhead from the continuous batching loop at the cost of doubled VRAM usage. Auto-detected when `None`.

use_cuda_graph (`bool` or `tuple[bool, bool]`, *optional*) : Whether to enable CUDA graphs. This can be a tuple of booleans (one for the varlen path and one for the decode fast path), a boolean which will apply to both paths, or None (automatically inferred). After calling `decide_use_cuda_graphs`, the attribute will be a tuple of booleans. Default is None (automatically inferred).

q_padding_interval_size (`int`, *optional*, defaults to 0) : Query padding granularity in tokens for CUDA graphs. Uses a preset from `continuous_api.py` when set to 0.

kv_padding_interval_size (`int`, *optional*, defaults to 0) : KV padding granularity in tokens for CUDA graphs. Uses a preset from `continuous_api.py` when set to 0.

max_cached_graphs (`int`, *optional*, defaults to 0) : Maximum number of cached CUDA graphs. Uses a preset from `continuous_api.py` when set to 0.

varlen_compile_config (`CompileConfig`, *optional*) : CompileConfig for varlen (prefill) path. Default is None (uses generation_config fallback) The varlen path handles batches with varying query and KV lengths, often benefiting from dynamic=True.

decode_compile_config (`CompileConfig`, *optional*) : CompileConfig for decode (fast) path. Default is None (uses generation_config fallback) The decode path handles batches has no dynamic KV length, so static shapes are a better fit.

use_default_compile_configs (`bool`, *optional*, defaults to `False`) : If True, a default compile config will be used for paths that are not explicitly set.

scheduler_type (`str`, *optional*, defaults to `"fifo"`) : Scheduler type to use.

return_logprobs (`bool`, *optional*, defaults to `False`) : Whether to return log probabilities along with the generated tokens.

cpu_offload_space (`float`, *optional*, defaults to 0.0) : CPU swap space in GiB for KV cache offloading. A pre-allocated pinned CPU buffer of this size is created at initialization. When the GPU cache is full, evicted requests' KV caches are copied here instead of being discarded. 0 disables offloading (default).

cpu_offload_space_safety_threshold (`float`, *optional*, defaults to 0.8) : If `cpu_offload_space` exceeds this fraction of total system RAM, it is clamped to avoid host OOM. Set to 1.0 to disable the safety cap. Ignored when psutil is not available.

max_queue_size (`int`, *optional*, defaults to 0) : Maximum request queue size for serving. 0 means unlimited.

per_request_processors (`bool`, *optional*, defaults to `False`) : Enable per-request logits processor parameters. Default is False.

drop_unsupported_processors (`bool`, *optional*, defaults to `True`) : Remove unsupported logits processors instead of erroring. Default is True.

