PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Paper
• 2510.14528 • Published
• 120
本模型為 PaddlePaddle/PP-DocLayoutV2 的 ONNX 轉換版本,使用 Paddle2ONNX v2.1.0 轉換,ONNX Opset 版本為 17。無須安裝 PaddlePaddle,可直接透過 ONNX Runtime 推論。
| 名稱 | 形狀 | 型別 | 說明 |
|---|---|---|---|
im_shape |
[batch, 2] |
float32 | 模型輸入尺寸,固定為 [800, 800](非原始圖片尺寸) |
image |
[batch, 3, 800, 800] |
float32 | 預處理後的圖片張量(CHW、RGB、值域 [0, 1]) |
scale_factor |
[batch, 2] |
float32 | 縮放比例 [scale_h, scale_w],其中 scale_h = 800 / original_height,scale_w = 800 / original_width |
注意:
im_shape傳入的是模型的輸入解析度(800, 800),而非原始圖片的尺寸。scale_factor為「目標尺寸 / 原始尺寸」,方向請勿填反。
| 名稱 | 形狀 | 型別 | 說明 |
|---|---|---|---|
fetch_name_0 |
[N, 8] |
float32 | 偵測結果:[label_id, score, xmin, ymin, xmax, ymax, -, -] |
fetch_name_1 |
[batch] |
int32 | 每張圖片的有效偵測數量 |
座標已還原至原始圖片像素空間。
依據原始模型 inference.yml(norm_type: none,不套用 ImageNet 均值/標準差):
import cv2
import numpy as np
import onnxruntime as ort
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
# 讀取圖片
image = cv2.imread("document.jpg") # BGR
original_height, original_width = image.shape[:2]
# 前處理
resized = cv2.resize(image, (800, 800))
resized_rgb = cv2.cvtColor(resized, cv2.COLOR_BGR2RGB).astype(np.float32) / 255.0
image_tensor = np.transpose(resized_rgb, (2, 0, 1))[np.newaxis, :] # [1, 3, 800, 800]
# 準備輸入
# im_shape 為模型輸入尺寸(固定 800x800),不是原始圖片尺寸
im_shape = np.array([[800, 800]], dtype=np.float32)
# scale_factor = 目標尺寸 / 原始尺寸
scale_factor = np.array([[800 / original_height, 800 / original_width]], dtype=np.float32)
# 推論
outputs = session.run(
['fetch_name_0', 'fetch_name_1'],
{
'im_shape': im_shape,
'image': image_tensor,
'scale_factor': scale_factor,
}
)
detections = outputs[0] # shape: (N, 8)
valid_count = int(outputs[1][0]) # 有效偵測數量
results = detections[:valid_count]
# results[:, 0] = label_id, results[:, 1] = score
# results[:, 2:6] = [xmin, ymin, xmax, ymax](已還原至原始圖片空間)
@misc{cui2025paddleocrvlboostingmultilingualdocument,
title={PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model},
author={Cheng Cui and Ting Sun and Suyin Liang and Tingquan Gao and Zelun Zhang and Jiaxuan Liu and Xueqing Wang and Changda Zhou and Hongen Liu and Manhui Lin and Yue Zhang and Yubo Zhang and Handong Zheng and Jing Zhang and Jun Zhang and Yi Liu and Dianhai Yu and Yanjun Ma},
year={2025},
eprint={2510.14528},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.14528},
}
Base model
PaddlePaddle/PP-DocLayoutV2