docs: add Windows ONNX Runtime usage (CPU / NPU / GPU) with WinML CLI

This PR adds a small "Run as ONNX" subsection to the Usage section.
It points users at three ways to run the model outside the original
PyTorch path:

- Microsoft's WinML CLI for conversion + quantization
- Windows ML for on-device inference (NPU / GPU / CPU)
- ONNX Runtime for cross-platform inference

Why merge:

- Reaches a much wider audience. Windows is the largest desktop
install base, and Windows ML now ships built-in NPU support on
Copilot+ PCs. Surfacing an NPU path on the model card makes this
model directly discoverable to Windows app developers who would
otherwise skip it because the existing card only shows the
PyTorch CPU path.

- Removes the biggest deployment blocker. PyTorch CPU inference
runs at ~621 ms / image on an Intel Core Ultra 7 258V; the same
model on the NPU runs at ~44 ms (~14x speedup) at roughly half
the file size, with mAP within 1% of the baseline. That moves
this model from "batch / offline" into "interactive UX"
territory for document / PDF tools.

Benchmark numbers and a full reproduction walkthrough live at:
https://github.com/microsoft/winml-cli/blob/main/examples/microsoft-table-transformer-detection/README.md

Files changed (1) hide show

README.md +16 -1

README.md CHANGED Viewed

@@ -17,4 +17,19 @@ The Table Transformer is equivalent to [DETR](https://huggingface.co/docs/transf
 ## Usage
-You can use the raw model for detecting tables in documents. See the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/table-transformer) for more info.

 ## Usage
+You can use the raw model for detecting tables in documents. See the [documentation](https://huggingface.co/docs/transformers/main/en/model_doc/table-transformer) for more info.
+### Run as ONNX (CPU / NPU / GPU)
+Detect tables ~14× faster on a Windows NPU at half the model size, with mAP within 1% of the original PyTorch checkpoint — by exporting this model to ONNX. You can also export to ONNX to run on CPU or GPU.
+Benchmarked on an Intel Core Ultra 7 258V (PubTables-1M validation, 1000 samples):
+| Model   | Device       | Precision   | mAP    | mean latency (ms) | p50 latency (ms) | Size (MB) |
+|---------|--------------|-------------|--------|-------------------|------------------|-----------|
+| PyTorch | CPU          | fp32        | 0.9887 | 620.9             | 600.3            | 115       |
+| ONNX    | OpenVINO NPU | w8a16 (QDQ) | 0.9822 | 44.1              | 41.6             | 58        |
+- **How to convert** — Export and quantize with [Microsoft's WinML CLI](https://github.com/microsoft/winml-cli). The NPU build is QDQ-quantized to w8a16; fp32 builds for CPU and GPU are also supported. End-to-end build, evaluation, and a Python inference example: [examples/microsoft-table-transformer-detection](https://github.com/microsoft/winml-cli/blob/main/examples/microsoft-table-transformer-detection/README.md).
+- **How to run on Windows** — Use [Windows ML](https://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview), which manages execution providers for NPU / GPU / CPU and routes ONNX inference to the right backend automatically.
+- **How to run on other platforms** — Use [ONNX Runtime](https://onnxruntime.ai/docs/) with the execution provider of your choice (OpenVINO, QNN, DirectML, CUDA, CPU, etc.).