Serve CLI

The transformers serve CLI is a lightweight option for local or self-hosted servers. It avoids the extra runtime and operational overhead of dedicated inference engines like vLLM. Use it for evaluation, experimentation, and moderate load deployments. Features like continuous batching increase throughput and lower latency.

For large scale production deployments, use vLLM or SGLang with a Transformer model as the backend. Learn more in the Inference backends guide.

The transformers serve command spawns a local server compatible with the OpenAI SDK. The server works with many third-party applications and supports the REST APIs below.

/v1/chat/completions for text, image, audio, and video requests
/v1/completions for legacy text completions from a freeform prompt
/v1/responses supports the Responses API
/v1/audio/transcriptions for audio transcriptions
/v1/models lists available models for third-party integrations
/load_model streams model loading progress via SSE

Install the serving dependencies.

pip install transformers[serving]

Run transformers serve to launch a server. The default server address is http://localhost:8000.

transformers serve

v1/chat/completions

The v1/chat/completions API is based on the Chat Completions API. It supports text, image, audio, and video requests for LLMs, VLMs, and multimodal models. Use it with curl, the InferenceClient, or the OpenAI client.

Text-based completions

huggingface_hub

huggingface_hub (stream)

openai

openai (stream)

curl

curl (stream)

Image-based completions

huggingface_hub

huggingface_hub (stream)

openai

openai (stream)

curl

curl (stream)

Audio-based completions

Multimodal models like Gemma 4 and Qwen2.5-Omni accept audio input through the OpenAI input_audio content type. Base64-encode the audio and specify the format (mp3 or wav).

huggingface_hub

huggingface_hub (stream)

openai

openai (stream)

curl

curl (stream)

The audio_url content type is an extension not part of the OpenAI standard and may change in future versions.

You can also pass audio by URL with the audio_url content type to skip base64 encoding.

completion = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio."},
                {"type": "audio_url", "audio_url": {"url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"}},
            ],
        }
    ],
)

Video-based completions

The video_url content type is an extension not part of the OpenAI standard and may change in future versions.

Use the video_url content type for video input. If the model supports audio (e.g. Gemma 4, Qwen2.5-Omni), the server extracts the audio track from the video and processes it with the visual frames.

Video processing requires torchcodec. Install it with pip install torchcodec.

huggingface_hub

huggingface_hub (stream)

openai

openai (stream)

curl

curl (stream)

Multi-turn conversations

To have a multi-turn conversation, include the full conversation history in the messages list with alternating user and assistant roles. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="<random_string>")

completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    messages=[
        {"role": "user", "content": "What is the capital of France?"},
        {"role": "assistant", "content": "The capital of France is Paris."},
        {"role": "user", "content": "How many people live there?"},
    ],
)
print(completion.choices[0].message.content)

The follow-up question “How many people live there?” relies on the prior context, so the model answers about Paris.

As of 2021, the population of Paris is approximately 2.2 million people.

v1/completions

The v1/completions API is based on the legacy Completions API. Unlike /v1/chat/completions, it takes a freeform text prompt instead of chat messages and returns generated text in choices[].text. This is useful for base (non-instruct) models and text completion tasks where a chat template is not needed. It also supports suffix for fill-in-the-middle text insertion.

curl

openai

v1/responses

The Responses API is OpenAI’s latest API endpoint for generation. It supports stateful interactions and integrates built-in tools to extend a model’s capabilities. OpenAI recommends using the Responses API over the Chat Completions API for new projects.

The v1/responses API supports text, image, audio, and video requests through the curl command and OpenAI client.

Text-based responses

openai

openai (stream)

curl

curl (stream)

Image-based responses

The Responses API also supports image, audio, and video inputs.

openai

openai (stream)

curl

curl (stream)

Audio-based responses

openai

openai (stream)

curl

curl (stream)

The audio_url content type is an extension not part of the OpenAI standard and may change in future versions.

You can also pass audio by URL with the audio_url content type to skip base64 encoding.

response = client.responses.create(
    model="google/gemma-4-E2B-it",
    input=[
        {"type": "input_text", "text": "Transcribe this audio."},
        {"type": "audio_url", "audio_url": {"url": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama_first_45_secs.mp3"}},
    ],
    max_output_tokens=256,
    stream=False,
)

Video-based responses

The video_url content type is an extension not part of the OpenAI standard and may change in future versions.

Video processing requires torchcodec. Install it with pip install torchcodec.

openai

openai (stream)

curl

curl (stream)

Multi-turn conversations

For multi-turn conversations, pass a list of messages with role keys in the input field. Like all OpenAI-compatible servers, the API is stateless, so every request must contain the complete conversation history.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="<random_string>")

response = client.responses.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    input=[
        {"role": "user", "content": "What is the capital of France?"},
        {"role": "assistant", "content": "The capital of France is Paris."},
        {"role": "user", "content": "How many people live there?"},
    ],
    max_output_tokens=256,
    stream=False,
)
print(response.output[0].content[0].text)

The follow-up question “How many people live there?” relies on the prior context, so the model answers about Paris.

As of 2021, Paris has a population of approximately 2.8 million people.

v1/audio/transcriptions

The v1/audio/transcriptions endpoint transcribes audio using speech-to-text models. It follows the Audio transcription API format.

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -H "Content-Type: multipart/form-data" \
  -F "file=@/path/to/audio.wav" \
  -F "model=openai/whisper-large-v3"

The command returns the following response.

{
  "text": "Transcribed text from the audio file",
}

v1/models

The v1/models endpoint scans your local Hugging Face cache and returns a list of downloaded models in the OpenAI-compatible format. Third-party tools use this endpoint to discover available models.

Download a model before running transformers serve.

transformers download Qwen/Qwen2.5-0.5B-Instruct

Once downloaded, the model appears in /v1/models responses.

curl http://localhost:8000/v1/models

The endpoint returns a JSON object with available models.

Loading models

The /load_model endpoint pre-loads a model and streams progress via Server-Sent Events (SSE). The transformers chat CLI uses it automatically so users see download and loading progress instead of a hanging prompt. Use it to warm up a model before sending inference requests.

Request

curl -N -X POST http://localhost:8000/load_model \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2.5-0.5B-Instruct"}'

The model field is a Hugging Face model identifier, optionally with an @revision suffix (Qwen/Qwen2.5-0.5B-Instruct@main). Omitting the revision defaults to main.

Response

The response is an SSE stream (Content-Type: text/event-stream). Each frame is a JSON object on a data: line.

data: {"status": "loading", "model": "Qwen/Qwen2.5-0.5B-Instruct@main", "stage": "processor"}

Every event contains at minimum a status and model field. Additional fields depend on the status.

Field	Present when	Description
`status`	Always	`loading`, `ready`, or `error`
`model`	Always	Canonical `model_id@revision`
`stage`	`status == "loading"`	One of `processor`, `config`, `download`, `weights` (see stages below)
`progress`	`download` and `weights` stages	Object with `current` and `total` (integer or null)
`cached`	`status == "ready"`	`true` if the model was already in memory
`message`	`status == "error"`	Error description

Stages

Loading progresses through these stages in order. Some may be skipped (download is skipped when files are already cached locally).

Stage	Has progress?	Description
`processor`	No	Loading the tokenizer/processor
`config`	No	Loading model configuration
`download`	Yes (bytes)	Downloading model files
`weights`	Yes (items)	Loading weight tensors into memory

The stream ends with exactly one terminal event, ready (success) or error (failure).

Timeout

transformers serve handles requests for any model. Each model loads on demand and stays in GPU memory. Models unload automatically after 300 seconds of inactivity to free GPU memory. Set --model-timeout to a different value in seconds, or -1 to disable unloading.

transformers serve --model-timeout 400

Loading examples

The examples below show responses for a freshly downloaded model, a model loaded from your local cache (skips the download stage), and a model already in memory.

fresh load

cached files

in memory

Tool calling

The transformers serve server supports OpenAI-style function calling. Models trained for tool-use generate structured function calls that your application executes.

Tool calling works with any model whose tokenizer declares tool call tokens. Qwen and Gemma 4 work out of the box. Open an issue to request support for a specific model.

Define tools as a list of function specifications following the OpenAI format.

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="<KEY>")

tools = [
  {
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get the current weather in a location",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "The city name, e.g. San Francisco"
          },
          "unit": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"],
            "description": "temperature unit"
          }
        },
        "required": ["location"]
      }
    }
  }
]

Customize generation by passing GenerationConfig parameters to the extra_body argument in create.

generation_config = {
  "max_new_tokens": 512,
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 50,
  "do_sample": True,
  "repetition_penalty": 1.1,
  "no_repeat_ngram_size": 3,
}

response = client.responses.create(
  model="Qwen/Qwen2.5-7B-Instruct",
  instructions="You are a helpful weather assistant. Use the get_weather tool to answer questions.",
  input="What's the weather like in San Francisco?",
  tools=tools,
  stream=True,
  extra_body={"generation_config": json.dumps(generation_config)}
)

for event in response:
  print(event)

Multi-turn tool calling

After the model returns a tool call, execute the function locally, then send the result back in a follow-up request to get the model’s final answer. The pattern differs slightly between the two APIs. See the OpenAI function calling guide for the full spec.

The examples below reuse the tools list defined above.

v1/chat/completions

v1/responses

Reasoning

Reasoning models emit a hidden chain-of-thought before the final answer. The server detects these thinking spans, strips the delimiters from the visible answer, and surfaces the reasoning text in a dedicated field so clients can render it separately.

Chat Completions returns reasoning as reasoning_content on the assistant message.
The Responses API returns reasoning as a reasoning output item that precedes the message item.

Reasoning detection relies on the model’s chat template and tokenizer. Models with custom thinking delimiters (Gemma 4) declare them with the tokenizer’s response_schema. Models with inline <think>...</think> tags (Qwen3, DeepSeek-R1) work with the default schema.

Enable reasoning on the server

Use --reasoning to control whether the chat template emits thinking tokens.

Value	Behavior
`auto` (default)	Defer to the chat template’s default.
`on`	Force thinking on by setting `enable_thinking=True` in the chat template kwargs.
`off`	Force thinking off by setting `enable_thinking=False`.

transformers serve Qwen/Qwen3-1.7B --reasoning on

Not all models support reasoning. --reasoning only works when the model’s chat template handles the enable_thinking variable (or when its tokenizer declares thinking delimiters). For other models, the flag is silently ignored.

--reasoning on and --reasoning off work by setting the chat template variable enable_thinking. Some chat templates use a different variable name to toggle thinking, or accept extra variables that change how the prompt is rendered. Pass any chat template variable directly with --chat-template-kwargs as a JSON object.

transformers serve Qwen/Qwen3-1.7B \
  --chat-template-kwargs '{"enable_thinking": true}'

Clients can also send chat_template_kwargs in the request body to override the server defaults for a single request, without restarting the server.

Chat Completions

Non-streaming responses include reasoning_content alongside content on the assistant message.

openai

openai (stream)

curl

Responses API

The Responses API returns reasoning as a separate reasoning output item that precedes the message item.

openai

openai (stream)

Multi-turn round trip

For multi-turn conversations, include the reasoning text from the previous response in the next request. The server passes it to the chat template, which can render it as part of the prior assistant turn so the model sees its own earlier thought process.

v1/chat/completions

v1/responses

Port forwarding

Port forwarding lets you serve models from a remote server. Make sure you have SSH access to the server, then run this command on your local machine.

ssh -N -f -L 8000:localhost:8000 your_server_account@your_server_IP -p port_to_ssh_into_your_server

Reproducibility

Pass a model as a positional argument to avoid per-request model hints and produce stable, repeatable runs.

transformers serve Qwen/Qwen2.5-0.5B-Instruct \
  --continuous-batching \
  --dtype "bfloat16"

Update on GitHub

Transformers

Serve CLI

v1/chat/completions

Text-based completions

Image-based completions

Audio-based completions

Video-based completions

Multi-turn conversations

v1/completions

v1/responses

Text-based responses

Image-based responses

Audio-based responses

Video-based responses

Multi-turn conversations

v1/audio/transcriptions

v1/models

Loading models

Request

Response

Stages

Timeout

Loading examples

Tool calling

Multi-turn tool calling

Reasoning

Enable reasoning on the server

Chat Completions

Responses API

Multi-turn round trip

Port forwarding

Reproducibility