Perception Encoder (PE-Core-B16-224) — LiteRT (TFLite) GPU

On-device LiteRT (.tflite) conversion of Perception Encoder Core (PE-Core, Meta 2025), the SOTA CLIP-style image tower, converted from timm/vit_pe_core_base_patch16_224.fb (ViT-B/16, 94M params; original facebook/PE-Core-B16-224).

A single forward pass turns one RGB image into a 1024-d L2-normalized image embedding for zero-shot classification, retrieval, and similarity — running fully on the LiteRT CompiledModel GPU accelerator (ML Drift): all 1028 ops are GPU-native (Replacing 1028 out of 1028 node(s) ... LITERT_CL), no CPU fallback, no Flex ops.

Files

File	Size	Description
`pe_core_base_224_fp16.tflite`	187 MB	FP16 single-graph model, GPU full-residency
`convert_pecore.py`	—	Reproducible conversion script (timm → tflite)

I/O

Input: [1, 3, 224, 224] float32, NCHW, RGB normalized to [-1, 1] i.e. (pixel/255 - 0.5) / 0.5 (timm mean/std = (0.5, 0.5, 0.5)). Normalization is applied by the caller (not baked into the graph).
Output: [1, 1024] float32, L2-normalized image embedding.

Usage (Android, LiteRT CompiledModel)

val model = CompiledModel.create(
    context.assets, "pe_core_base_224_fp16.tflite",
    CompiledModel.Options(Accelerator.GPU), null
)
val inputs = model.createInputBuffers()
val outputs = model.createOutputBuffers()
inputs[0].writeFloat(nchwFloatArray)        // [1,3,224,224], RGB scaled to [-1,1]
model.run(inputs, outputs)
val embedding = outputs[0].readFloat()      // [1024], already L2-normalized

For zero-shot classification, precompute text-label embeddings with the PE-Core text tower offline and take the dot product on device.

Performance

~66 ms / image steady-state on a Pixel 8a (Mali-G615) GPU (best 12.5 ms), full GPU residency, FP16.

Conversion notes

Converted with litert-torch / ai-edge-torch. Making a RoPE ViT image tower fully GPU-resident and numerically correct on the ML Drift GPU delegate required four verbatim (weights-exact, output corr ≈ 1.0) model-side rewrites — the first three for residency, the last for on-device numerical correctness:

Fused-qkv → 4D manual attention — the fused qkv reshape emits a 5D head-split the GPU delegate rejects; decompose into separate q/k/v projections. Self-attention uses scaled_dot_product_attention, whose lowering keeps the batch-matmul 3D with a materialized transpose (both required for residency).
Interleaved 2D-RoPE → rotate-half — PE-Core's interleaved rotary uses a strided x[..., ::2] that lowers to GATHER_ND (GPU-banned). Bake an even→odd channel permutation into the q/k weights (preserves q·k exactly) and apply the rotate-half form with constant cos/sin → clean MUL/ADD/SLICE/CONCAT.
Attention-pool single-query attention → broadcast-multiply + reduce-sum — the pooling query is a constant latent, so a batch-matmul there is const @ non-const (rejected at compile, and the reordered const-RHS form is mis-computed on device); expressing it as (q·k).sum + softmax + (attn·v).sum is exact and GPU-correct.
Overflow-safe LayerNorm — the delegate computes the LayerNorm variance reduction in fp16 even for an fp32 graph; deep-ViT "massive activations" (|x|~50+) make sum((x-mean)²) exceed fp16 max (65504), so the normalization is wrong and the error compounds with depth (output correlation collapses to ~0.28 over 12 blocks while still reporting full GPU residency). Scaling by 1/32 before squaring (undone after) keeps the running sum in range — mathematically identical to nn.LayerNorm.

Verified on a Pixel 8a GPU: zero banned ops, zero >4D tensors, full residency, and TFLite(GPU)-vs-PyTorch output correlation = 1.0 (the on-device GPU result — not just the host CPU result — matches the reference).

Training data & PII

PE-Core was pretrained by Meta on a large-scale web-crawled image–text dataset (billions of image–caption pairs, CLIP-style contrastive objective). No new training was performed for this conversion — it is a weights-exact format change of the public timm/facebook checkpoint. Because the source data is web-scraped, it may incidentally contain people, faces, text, and other PII; no PII was deliberately collected, and this conversion adds none. Users deploying the encoder should apply their own content/PII filtering as appropriate. See the original PE model card and paper for full dataset details.

License & attribution

Apache-2.0 (original PE-Core / timm checkpoint).
This is a format conversion; all credit to the original authors (Meta / FAIR).

Downloads last month: -

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/PE-Core-base-patch16-224-LiteRT

Base model

timm/vit_pe_core_base_patch16_224.fb

Finetuned

(2)

this model

Paper for mlboydaisuke/PE-Core-base-patch16-224-LiteRT

Perception Encoder: The best visual embeddings are not at the output of the network

Paper • 2504.13181 • Published Apr 17, 2025 • 37