| --- |
| license: apache-2.0 |
| pipeline_tag: image-classification |
| tags: |
| - image-classification |
| - multi-label-classification |
| - onnx |
| - openvino |
| - pdf |
| - document-understanding |
| - rag |
| datasets: |
| - Wikit/PdfVisClassif |
| --- |
| |
| # PDF Page Classifier |
|
|
| Multi-label classifier for PDF page images. Determines whether a PDF page |
| requires image embedding (vs. text-only) in RAG pipelines. |
|
|
| Backbone: EfficientNet-Lite0. Exported to ONNX and OpenVINO INT8 via |
| Quantization-Aware Training (QAT). **No PyTorch required at inference time.** |
|
|
| ## Classes |
|
|
| - `Complex Table` |
| - `Simple Table` |
| - `Visual - Essential` |
| - `Visual - Supportive` |
|
|
| Pages matching any of the following classes should trigger image embedding: |
|
|
| - `Complex Table` |
| - `Visual - Essential` |
|
|
| Default threshold: `0.5` |
|
|
| ## Usage |
|
|
| ### With [chunknorris](https://github.com/wikit-ai/chunknorris) (recommended) |
|
|
| ```bash |
| pip install "chunknorris[ml-onnx]" # ONNX backend |
| pip install "chunknorris[ml-openvino]" # OpenVINO INT8, fastest on CPU |
| ``` |
|
|
| ```python |
| from chunknorris.ml import load_classifier |
| |
| clf = load_classifier("Wikit/pdf-pages-classifier") # auto-selects best available backend |
| result = clf.predict("page.png") |
| # {"needs_image_embedding": True, "predicted_classes": [...], "probabilities": {...}} |
| ``` |
|
|
| ### Standalone (no chunknorris) |
|
|
| ```bash |
| git clone https://huggingface.co/Wikit/pdf-pages-classifier |
| cd pdf-pages-classifier |
| pip install onnxruntime Pillow numpy # or: openvino Pillow numpy |
| ``` |
|
|
| ```python |
| from classifiers import load_classifier |
| |
| clf = load_classifier(".") # auto-selects available backend |
| result = clf.predict("page.png") |
| ``` |
|
|
| ## Files |
|
|
| | File | Format | Notes | |
| |------|--------|-------| |
| | `model.onnx` | ONNX FP32 | Cross-platform CPU/GPU inference | |
| | `openvino_model.xml/.bin` | OpenVINO INT8 | Fastest CPU inference (QAT) | |
| | `pytorch_model.bin` | PyTorch | Raw checkpoint; requires `torch` + `timm` | |
| | `config.json` | JSON | Preprocessing config and class names | |
| | `classifiers/` | Python | Standalone inference scripts (no chunknorris needed) | |
|
|
| ## Dataset |
|
|
| Trained on [Wikit/PdfVisClassif](https://huggingface.co/datasets/Wikit/PdfVisClassif). |
|
|