Instructions to use mlboydaisuke/PE-Core-base-patch16-224-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use mlboydaisuke/PE-Core-base-patch16-224-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- PerceptionEncoder
How to use mlboydaisuke/PE-Core-base-patch16-224-LiteRT with PerceptionEncoder:
# Use PE-Core models as CLIP models import core.vision_encoder.pe as pe model = pe.CLIP.from_config("mlboydaisuke/PE-Core-base-patch16-224-LiteRT", pretrained=True)# Use any PE model as a vision encoder import core.vision_encoder.pe as pe model = pe.VisionTransformer.from_config("mlboydaisuke/PE-Core-base-patch16-224-LiteRT", pretrained=True) - Notebooks
- Google Colab
- Kaggle
Perception Encoder (PE-Core-B16-224) — LiteRT (TFLite) GPU
On-device LiteRT (.tflite) conversion of
Perception Encoder Core (PE-Core, Meta 2025), the SOTA CLIP-style image tower,
converted from timm/vit_pe_core_base_patch16_224.fb
(ViT-B/16, 94M params; original facebook/PE-Core-B16-224).
A single forward pass turns one RGB image into a 1024-d L2-normalized image
embedding for zero-shot classification, retrieval, and similarity — running
fully on the LiteRT CompiledModel GPU accelerator (ML Drift): all 1028
ops are GPU-native (Replacing 1028 out of 1028 node(s) ... LITERT_CL), no CPU
fallback, no Flex ops.
Files
| File | Size | Description |
|---|---|---|
pe_core_base_224_fp16.tflite |
187 MB | FP16 single-graph model, GPU full-residency |
convert_pecore.py |
— | Reproducible conversion script (timm → tflite) |
I/O
- Input:
[1, 3, 224, 224]float32, NCHW, RGB normalized to[-1, 1]i.e.(pixel/255 - 0.5) / 0.5(timm mean/std =(0.5, 0.5, 0.5)). Normalization is applied by the caller (not baked into the graph). - Output:
[1, 1024]float32, L2-normalized image embedding.
Usage (Android, LiteRT CompiledModel)
val model = CompiledModel.create(
context.assets, "pe_core_base_224_fp16.tflite",
CompiledModel.Options(Accelerator.GPU), null
)
val inputs = model.createInputBuffers()
val outputs = model.createOutputBuffers()
inputs[0].writeFloat(nchwFloatArray) // [1,3,224,224], RGB scaled to [-1,1]
model.run(inputs, outputs)
val embedding = outputs[0].readFloat() // [1024], already L2-normalized
For zero-shot classification, precompute text-label embeddings with the PE-Core text tower offline and take the dot product on device.
Performance
- ~66 ms / image steady-state on a Pixel 8a (Mali-G615) GPU (best 12.5 ms), full GPU residency, FP16.
Conversion notes
Converted with litert-torch / ai-edge-torch. Making a RoPE ViT image tower fully GPU-resident and numerically correct on the ML Drift GPU delegate required four verbatim (weights-exact, output corr ≈ 1.0) model-side rewrites — the first three for residency, the last for on-device numerical correctness:
- Fused-qkv → 4D manual attention — the fused
qkvreshape emits a 5D head-split the GPU delegate rejects; decompose into separate q/k/v projections. Self-attention usesscaled_dot_product_attention, whose lowering keeps the batch-matmul 3D with a materialized transpose (both required for residency). - Interleaved 2D-RoPE → rotate-half — PE-Core's interleaved rotary uses a
strided
x[..., ::2]that lowers toGATHER_ND(GPU-banned). Bake an even→odd channel permutation into the q/k weights (preserves q·k exactly) and apply the rotate-half form with constant cos/sin → cleanMUL/ADD/SLICE/CONCAT. - Attention-pool single-query attention → broadcast-multiply + reduce-sum —
the pooling query is a constant latent, so a batch-matmul there is
const @ non-const(rejected at compile, and the reorderedconst-RHSform is mis-computed on device); expressing it as(q·k).sum+ softmax +(attn·v).sumis exact and GPU-correct. - Overflow-safe LayerNorm — the delegate computes the LayerNorm variance
reduction in fp16 even for an fp32 graph; deep-ViT "massive activations"
(|x|~50+) make
sum((x-mean)²)exceed fp16 max (65504), so the normalization is wrong and the error compounds with depth (output correlation collapses to ~0.28 over 12 blocks while still reporting full GPU residency). Scaling by 1/32 before squaring (undone after) keeps the running sum in range — mathematically identical tonn.LayerNorm.
Verified on a Pixel 8a GPU: zero banned ops, zero >4D tensors, full residency, and TFLite(GPU)-vs-PyTorch output correlation = 1.0 (the on-device GPU result — not just the host CPU result — matches the reference).
Training data & PII
PE-Core was pretrained by Meta on a large-scale web-crawled image–text dataset
(billions of image–caption pairs, CLIP-style contrastive objective). No new
training was performed for this conversion — it is a weights-exact format change
of the public timm/facebook checkpoint. Because the source data is
web-scraped, it may incidentally contain people, faces, text, and other PII;
no PII was deliberately collected, and this conversion adds none. Users deploying
the encoder should apply their own content/PII filtering as appropriate. See the
original PE model card and
paper for full dataset details.
License & attribution
- Apache-2.0 (original PE-Core / timm checkpoint).
- This is a format conversion; all credit to the original authors (Meta / FAIR).
- Downloads last month
- -
Model tree for mlboydaisuke/PE-Core-base-patch16-224-LiteRT
Base model
timm/vit_pe_core_base_patch16_224.fb