ExecuTorch

ExecuTorch is a lightweight runtime for model inference on edge devices. It exports a PyTorch model into a portable, ahead-of-time format. A small C++ runtime plans memory and dispatches operations to hardware-specific backends. Execution and memory behavior is known before the model runs on device, so inference overhead is low.

Export a Transformers model with the optimum-executorch library.

CLI

Python

Transformers integration

The export process uses several Transformers components.

from_pretrained() loads the model weights in safetensors format.
Optimum applies graph optimizations and runs torch.export to create a model.pte file targeting your hardware backend.
AutoTokenizer or AutoProcessor loads the tokenizer or processor files and runs during inference.
At runtime, a C++ runner class executes the .pte file on the ExecuTorch runtime.

#include <executorch/extension/llm/runner/text_llm_runner.h>

using namespace executorch::extension::llm;

int main() {
  // Load tokenizer and create runner
  auto tokenizer = load_tokenizer("path/to/tokenizer.json", nullptr, std::nullopt, 0, 0);
  auto runner = create_text_llm_runner("path/to/model.pte", std::move(tokenizer));

  // Load the model
  runner->load();

  // Configure generation
  GenerationConfig config;
  config.max_new_tokens = 100;
  config.temperature = 0.8f;

  // Generate text with streaming output
  runner->generate("The capital of France is", config,
    [](const std::string& token) { std::cout << token << std::flush; },
    nullptr);

  return 0;
}

Resources

Update on GitHub