| --- |
| library_name: tensorflow |
| tags: |
| - eye-gaze-estimation |
| - tflite |
| - mobile |
| - gated-inception |
| - coordinate-attention |
| - on-device |
| - accessibility |
| license: mit |
| pipeline_tag: image-classification |
| --- |
| |
| # ποΈ GazeInception-Lite: Mobile Eye Gaze Estimation |
|
|
| **Lightweight TFLite model that estimates where you're looking on a mobile phone screen.** |
|
|
| Built with a novel **Gated Inception** architecture that learns to skip unnecessary computation branches, making it extremely fast for on-device inference. |
|
|
| ## β¨ Key Features |
|
|
| | Feature | Details | |
| |---------|---------| |
| | π¦ **Works in Dark** | Trained with illumination perturbation + low-light augmentation (down to 15% brightness) | |
| | π **Glasses Support** | Trained with synthetic glasses overlay (10 frame styles, lens reflections) | |
| | ποΈ **Lazy Eye / Strabismus** | Dual-eye architecture processes each eye independently with shared weights | |
| | β‘ **Gated Inception** | Learned sigmoid gates skip inactive branches β reduces useless compute | |
| | π± **Mobile-First** | 89,754 params (single) / 136,922 params (dual) | |
| | π― **Coordinate Attention** | Encodes spatial position for precise iris localization | |
|
|
| ## π Performance |
|
|
| ### Accuracy |
|
|
| | Model | Screen Error | Inference (CPU) | FPS | |
| |-------|-------------|-----------------|-----| |
| | Single Eye (F16) | 4.2 mm | 0.59 ms | 1684 | |
| | Single Eye (INT8) | 4.3 mm | 0.62 ms | 1619 | |
| | Dual Eye (F16) | 14.2 mm | 1.50 ms | 666 | |
| | Dual Eye (INT8) | 14.3 mm | 0.93 ms | 1070 | |
|
|
|
|
| ### Robustness (Dual Eye Model) |
|
|
| | Condition | Screen Error | |
| |-----------|-------------| |
| | Dark / Low-light | 13.8 mm | |
| | With Glasses | 13.9 mm | |
| | Lazy Eye / Strabismus | 13.5 mm | |
|
|
|
|
| ## π¦ Available Models |
|
|
| | Model | File | Size | Best For | |
| |-------|------|------|----------| |
| | Single Eye F16 | `gaze_inception_lite_single_f16.tflite` | 161 KB | Ultra-low latency, simple apps | |
| | Single Eye INT8 | `gaze_inception_lite_single_int8.tflite` | 164 KB | Fastest on mobile NPU/DSP | |
| | Dual Eye F16 | `gaze_inception_lite_dual_f16.tflite` | 242 KB | Best accuracy, lazy eye support | |
| | Dual Eye INT8 | `gaze_inception_lite_dual_int8.tflite` | 267 KB | Best accuracy + speed combo | |
|
|
| ## ποΈ Architecture |
|
|
| ### Gated Inception Block |
| ``` |
| Input |
| βββ Branch 1: 1Γ1 Conv (point features) ββββ Γ gate[0] |
| βββ Branch 2: 1Γ1 β 3Γ3 DWConv (local) ββ Γ gate[1] |
| βββ Branch 3: 1Γ1 β 5Γ5 DWConv (wide) ββ Γ gate[2] |
| βββ Branch 4: MaxPool β 1Γ1 Conv (pool) ββ Γ gate[3] |
| β |
| Gate Network: GAP β Dense β Sigmoid βββββββββββββββββ |
| β |
| Output: Concat(gated branches) ββββββββββββββββββββββ |
| ``` |
|
|
| The **gate values** (0-1 sigmoid) are learned per-sample. For "easy" inputs (centered gaze, good lighting), the network learns to rely on fewer branches. For complex inputs (extreme gaze, dark, glasses), all branches activate. This provides **adaptive computation** β fast when possible, thorough when needed. |
|
|
| ### Full Pipeline (Dual Eye Model) |
| ``` |
| Left Eye (64Γ64) βββ |
| βββ Shared Eye Backbone βββ |
| Right Eye (64Γ64) βββ (Gated Inception Γ3 βββ Concat β Dense β (x,y) |
| + CoordAttention) β |
| Face (64Γ64) ββββ Lightweight CNN ββββββββββββββ |
| ``` |
|
|
| ## π Quick Start (Python) |
|
|
| ```python |
| import tensorflow as tf |
| import numpy as np |
| |
| # Load model |
| interpreter = tf.lite.Interpreter(model_path="gaze_inception_lite_single_f16.tflite") |
| interpreter.allocate_tensors() |
| |
| input_details = interpreter.get_input_details() |
| output_details = interpreter.get_output_details() |
| |
| # Prepare eye crop (64x64 RGB, normalized to [0,1]) |
| eye_crop = preprocess_eye(frame) # Your eye detection + crop function |
| eye_input = np.expand_dims(eye_crop, axis=0).astype(np.float32) |
| |
| # Run inference |
| interpreter.set_tensor(input_details[0]['index'], eye_input) |
| interpreter.invoke() |
| |
| # Get screen coordinates |
| gaze_xy = interpreter.get_tensor(output_details[0]['index'])[0] |
| screen_x = gaze_xy[0] * screen_width # pixels |
| screen_y = gaze_xy[1] * screen_height # pixels |
| print(f"Looking at: ({screen_x:.0f}, {screen_y:.0f})") |
| ``` |
|
|
| ### Android (Java/Kotlin) |
| ```kotlin |
| val interpreter = Interpreter(loadModelFile("gaze_inception_lite_single_int8.tflite")) |
| val input = Array(1) { Array(64) { Array(64) { FloatArray(3) } } } |
| val output = Array(1) { FloatArray(2) } |
| |
| // Fill input with preprocessed eye crop |
| interpreter.run(input, output) |
| |
| val gazeX = output[0][0] * screenWidth |
| val gazeY = output[0][1] * screenHeight |
| ``` |
|
|
| ## π§ Training Details |
|
|
| - **Data**: 50,000 synthetic samples with comprehensive augmentations |
| - **Augmentations**: Dark conditions (30%), glasses (25%), lazy eye (15%), sensor noise (50%), illumination perturbation, diverse skin tones (12), eye colors (7) |
| - **Optimizer**: Adam with Cosine Decay LR (1e-3 β 1e-5) |
| - **Loss**: MSE on normalized (x,y) coordinates |
| - **Architecture Inspiration**: |
| - [AGE Framework](https://arxiv.org/abs/2603.26945) - augmentation pipeline |
| - [Gated Compression Layers](https://arxiv.org/abs/2303.08970) - gating mechanism |
| - [iTracker/GazeCapture](https://arxiv.org/abs/1606.05814) - dual-eye + face architecture |
| - [Coordinate Attention](https://arxiv.org/abs/2103.02907) - spatial attention |
|
|
| ## β οΈ Limitations |
|
|
| - Trained on **synthetic data** β fine-tuning on real gaze data (GazeCapture, ETH-XGaze) will significantly improve accuracy |
| - Screen coordinate output assumes front-facing phone camera centered above screen |
| - Requires separate face/eye detection (use MediaPipe Face Mesh for production) |
| - Lazy eye support is based on simulated strabismus β clinical validation needed |
|
|
| ## π License |
|
|
| MIT License β free for commercial and non-commercial use. |
|
|