Transformers documentation

TensorRT-LLM

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

TensorRT-LLM

TensorRT-LLM is optimizes LLM inference on NVIDIA GPUs. It compiles models into a TensorRT engine with in-flight batching, paged KV caching, and tensor parallelism. AutoDeploy accepts Transformers models without requiring any changes. It automatically converts the model to an optimized runtime.

Pass a model id from the Hub to build_and_run_ad.py to run a Transformers model.

cd examples/auto_deploy
python build_and_run_ad.py --model meta-llama/Llama-3.2-1B

Under the hood, AutoDeploy creates an LLM class. It loads the model configuration with AutoConfig.from_pretrained() and extracts any parallelism metadata stored in tp_plan. AutoModelForCausalLM.from_pretrained() loads the model with the config and enables Transformers’ built-in tensor parallelism.

from tensorrt_llm._torch.auto_deploy import LLM

llm = LLM(model="meta-llama/Llama-3.2-1B")

TensorRT-LLM extracts the model graph with torch.export and applies optimizations. It replaces Transformers attention with TensorRT-LLM attention kernels and compiles the model into an optimized execution backend.

Resources

Update on GitHub