--- library_name: tensorflow tags: - eye-gaze-estimation - tflite - mobile - gated-inception - coordinate-attention - on-device - accessibility license: mit pipeline_tag: image-classification --- # 👁️ GazeInception-Lite: Mobile Eye Gaze Estimation **Lightweight TFLite model that estimates where you're looking on a mobile phone screen.** Built with a novel **Gated Inception** architecture that learns to skip unnecessary computation branches, making it extremely fast for on-device inference. ## ✨ Key Features | Feature | Details | |---------|---------| | 🔦 **Works in Dark** | Trained with illumination perturbation + low-light augmentation (down to 15% brightness) | | 👓 **Glasses Support** | Trained with synthetic glasses overlay (10 frame styles, lens reflections) | | 👁️ **Lazy Eye / Strabismus** | Dual-eye architecture processes each eye independently with shared weights | | ⚡ **Gated Inception** | Learned sigmoid gates skip inactive branches → reduces useless compute | | 📱 **Mobile-First** | 89,754 params (single) / 136,922 params (dual) | | 🎯 **Coordinate Attention** | Encodes spatial position for precise iris localization | ## 📊 Performance ### Accuracy | Model | Screen Error | Inference (CPU) | FPS | |-------|-------------|-----------------|-----| | Single Eye (F16) | 4.2 mm | 0.59 ms | 1684 | | Single Eye (INT8) | 4.3 mm | 0.62 ms | 1619 | | Dual Eye (F16) | 14.2 mm | 1.50 ms | 666 | | Dual Eye (INT8) | 14.3 mm | 0.93 ms | 1070 | ### Robustness (Dual Eye Model) | Condition | Screen Error | |-----------|-------------| | Dark / Low-light | 13.8 mm | | With Glasses | 13.9 mm | | Lazy Eye / Strabismus | 13.5 mm | ## 📦 Available Models | Model | File | Size | Best For | |-------|------|------|----------| | Single Eye F16 | `gaze_inception_lite_single_f16.tflite` | 161 KB | Ultra-low latency, simple apps | | Single Eye INT8 | `gaze_inception_lite_single_int8.tflite` | 164 KB | Fastest on mobile NPU/DSP | | Dual Eye F16 | `gaze_inception_lite_dual_f16.tflite` | 242 KB | Best accuracy, lazy eye support | | Dual Eye INT8 | `gaze_inception_lite_dual_int8.tflite` | 267 KB | Best accuracy + speed combo | ## 🏗️ Architecture ### Gated Inception Block ``` Input ├── Branch 1: 1×1 Conv (point features) ──── × gate[0] ├── Branch 2: 1×1 → 3×3 DWConv (local) ── × gate[1] ├── Branch 3: 1×1 → 5×5 DWConv (wide) ── × gate[2] └── Branch 4: MaxPool → 1×1 Conv (pool) ── × gate[3] │ Gate Network: GAP → Dense → Sigmoid ────────────────┘ │ Output: Concat(gated branches) ◄────────────────────┘ ``` The **gate values** (0-1 sigmoid) are learned per-sample. For "easy" inputs (centered gaze, good lighting), the network learns to rely on fewer branches. For complex inputs (extreme gaze, dark, glasses), all branches activate. This provides **adaptive computation** — fast when possible, thorough when needed. ### Full Pipeline (Dual Eye Model) ``` Left Eye (64×64) ──┐ ├── Shared Eye Backbone ──┐ Right Eye (64×64) ──┘ (Gated Inception ×3 ├── Concat → Dense → (x,y) + CoordAttention) │ Face (64×64) ──── Lightweight CNN ─────────────┘ ``` ## 🚀 Quick Start (Python) ```python import tensorflow as tf import numpy as np # Load model interpreter = tf.lite.Interpreter(model_path="gaze_inception_lite_single_f16.tflite") interpreter.allocate_tensors() input_details = interpreter.get_input_details() output_details = interpreter.get_output_details() # Prepare eye crop (64x64 RGB, normalized to [0,1]) eye_crop = preprocess_eye(frame) # Your eye detection + crop function eye_input = np.expand_dims(eye_crop, axis=0).astype(np.float32) # Run inference interpreter.set_tensor(input_details[0]['index'], eye_input) interpreter.invoke() # Get screen coordinates gaze_xy = interpreter.get_tensor(output_details[0]['index'])[0] screen_x = gaze_xy[0] * screen_width # pixels screen_y = gaze_xy[1] * screen_height # pixels print(f"Looking at: ({screen_x:.0f}, {screen_y:.0f})") ``` ### Android (Java/Kotlin) ```kotlin val interpreter = Interpreter(loadModelFile("gaze_inception_lite_single_int8.tflite")) val input = Array(1) { Array(64) { Array(64) { FloatArray(3) } } } val output = Array(1) { FloatArray(2) } // Fill input with preprocessed eye crop interpreter.run(input, output) val gazeX = output[0][0] * screenWidth val gazeY = output[0][1] * screenHeight ``` ## 🔧 Training Details - **Data**: 50,000 synthetic samples with comprehensive augmentations - **Augmentations**: Dark conditions (30%), glasses (25%), lazy eye (15%), sensor noise (50%), illumination perturbation, diverse skin tones (12), eye colors (7) - **Optimizer**: Adam with Cosine Decay LR (1e-3 → 1e-5) - **Loss**: MSE on normalized (x,y) coordinates - **Architecture Inspiration**: - [AGE Framework](https://arxiv.org/abs/2603.26945) - augmentation pipeline - [Gated Compression Layers](https://arxiv.org/abs/2303.08970) - gating mechanism - [iTracker/GazeCapture](https://arxiv.org/abs/1606.05814) - dual-eye + face architecture - [Coordinate Attention](https://arxiv.org/abs/2103.02907) - spatial attention ## ⚠️ Limitations - Trained on **synthetic data** — fine-tuning on real gaze data (GazeCapture, ETH-XGaze) will significantly improve accuracy - Screen coordinate output assumes front-facing phone camera centered above screen - Requires separate face/eye detection (use MediaPipe Face Mesh for production) - Lazy eye support is based on simulated strabismus — clinical validation needed ## 📝 License MIT License — free for commercial and non-commercial use.