# TensorRT-LLM

[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) is optimizes LLM inference on NVIDIA GPUs. It compiles models into a TensorRT engine with in-flight batching, paged KV caching, and tensor parallelism. [AutoDeploy](https://nvidia.github.io/TensorRT-LLM/torch/auto_deploy/auto-deploy.html) accepts Transformers models without requiring any changes. It automatically converts the model to an optimized runtime.

Pass a model id from the Hub to [build_and_run_ad.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/auto_deploy/build_and_run_ad.py) to run a Transformers model.

```bash
cd examples/auto_deploy
python build_and_run_ad.py --model meta-llama/Llama-3.2-1B
```

Under the hood, AutoDeploy creates an [LLM](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.LLM) class. It loads the model configuration with [AutoConfig.from_pretrained()](/docs/transformers/v5.8.0/en/model_doc/auto#transformers.AutoConfig.from_pretrained) and extracts any parallelism metadata stored in `tp_plan`. [AutoModelForCausalLM.from_pretrained()](/docs/transformers/v5.8.0/en/model_doc/auto#transformers.AutoModel.from_pretrained) loads the model with the config and enables Transformers' built-in tensor parallelism.

```py
from tensorrt_llm._torch.auto_deploy import LLM

llm = LLM(model="meta-llama/Llama-3.2-1B")
```

TensorRT-LLM extracts the model graph with `torch.export` and applies optimizations. It replaces Transformers attention with TensorRT-LLM [attention kernels](https://github.com/NVIDIA/TensorRT-LLM/tree/main/tensorrt_llm/_torch/attention_backend) and compiles the model into an optimized execution backend.

## Resources

- [TensorRT-LLM docs](https://nvidia.github.io/TensorRT-LLM/) for more detailed usage guides.
- [AutoDeploy guide](https://nvidia.github.io/TensorRT-LLM/torch/auto_deploy/auto-deploy.html) explains how it works with advanced examples.

