| --- |
| license: mit |
| tags: |
| - security |
| - vulnerability-report |
| --- |
| |
| # Vulnerability Report: Unauthenticated RCE in TensorRT-LLM (MGMN Leader Node) |
|
|
| ## Summary |
| I have identified a Critical Remote Code Execution (RCE) vulnerability in the TensorRT-LLM Multi-GPU Multi-Node (MGMN) launcher. The vulnerability exists in the `mgmn_leader_node.py` script, which initializes an IPC server without enforcing HMAC authentication. Combined with insecure environment variable handling, this allows a local or network attacker to force the server to bind to an external interface and execute arbitrary code. |
|
|
| ## Vulnerability Details |
| **Component:** `tensorrt_llm/llmapi/mgmn_leader_node.py` and `tensorrt_llm/llmapi/mpi_session.py` |
|
|
| **Root Cause:** |
| 1. **Insecure Default Initialization:** |
| In `tensorrt_llm/llmapi/mgmn_leader_node.py`, the `RemoteMpiCommSessionServer` is initialized without passing an `hmac_key`. |
| |
| ```python |
| # mgmn_leader_node.py |
| server = RemoteMpiCommSessionServer( |
| comm=sub_comm, |
| n_workers=num_ranks, |
| addr=get_spawn_proxy_process_ipc_addr_env(), |
| is_comm=True) # MISSING hmac_key |
| Security Fallback Failure: In tensorrt_llm/llmapi/mpi_session.py, the __init__ method sets use_hmac_encryption to False if no key is provided. |
| |
| Python |
| |
| # mpi_session.py |
| self.queue = ZeroMqQueue(..., use_hmac_encryption=bool(hmac_key)) |
| This disables the signature check on the IPC socket, allowing unauthenticated pickle.loads deserialization. |
| |
| Insecure Environment Variable Handling: The bind address is derived from TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR (in utils.py), which can be controlled by any user on the system before the service starts. |
| |
| Attack Scenario |
| An attacker sets the environment variable: export TLLM_SPAWN_PROXY_PROCESS_IPC_ADDR="tcp://0.0.0.0:4444" |
| |
| The victim (or automated orchestration system) executes mgmn_leader_node.py. |
| |
| The server binds to port 4444 on all interfaces with HMAC Encryption Disabled. |
| |
| The attacker connects to port 4444 and sends a malicious Pickle payload containing shell commands (e.g., reverse shell). |
| |
| The ZeroMqQueue class deserializes the payload without verification, executing the attacker's code with the privileges of the TensorRT-LLM process. |
| |
| Impact |
| This vulnerability allows for Arbitrary Code Execution (ACE). In shared cluster environments (e.g., Slurm/Kubernetes), this allows a low-privileged user to escalate privileges or move laterally to other nodes running TensorRT-LLM. |