metadata
library_name: tensorflow
tags:
- eye-gaze-estimation
- tflite
- mobile
- gated-inception
- coordinate-attention
- on-device
- accessibility
license: mit
pipeline_tag: image-classification
ποΈ GazeInception-Lite: Mobile Eye Gaze Estimation
Lightweight TFLite model that estimates where you're looking on a mobile phone screen.
Built with a novel Gated Inception architecture that learns to skip unnecessary computation branches, making it extremely fast for on-device inference.
β¨ Key Features
| Feature | Details |
|---|---|
| π¦ Works in Dark | Trained with illumination perturbation + low-light augmentation (down to 15% brightness) |
| π Glasses Support | Trained with synthetic glasses overlay (10 frame styles, lens reflections) |
| ποΈ Lazy Eye / Strabismus | Dual-eye architecture processes each eye independently with shared weights |
| β‘ Gated Inception | Learned sigmoid gates skip inactive branches β reduces useless compute |
| π± Mobile-First | 89,754 params (single) / 136,922 params (dual) |
| π― Coordinate Attention | Encodes spatial position for precise iris localization |
π Performance
Accuracy
| Model | Screen Error | Inference (CPU) | FPS |
|---|---|---|---|
| Single Eye (F16) | 4.2 mm | 0.59 ms | 1684 |
| Single Eye (INT8) | 4.3 mm | 0.62 ms | 1619 |
| Dual Eye (F16) | 14.2 mm | 1.50 ms | 666 |
| Dual Eye (INT8) | 14.3 mm | 0.93 ms | 1070 |
Robustness (Dual Eye Model)
| Condition | Screen Error |
|---|---|
| Dark / Low-light | 13.8 mm |
| With Glasses | 13.9 mm |
| Lazy Eye / Strabismus | 13.5 mm |
π¦ Available Models
| Model | File | Size | Best For |
|---|---|---|---|
| Single Eye F16 | gaze_inception_lite_single_f16.tflite |
161 KB | Ultra-low latency, simple apps |
| Single Eye INT8 | gaze_inception_lite_single_int8.tflite |
164 KB | Fastest on mobile NPU/DSP |
| Dual Eye F16 | gaze_inception_lite_dual_f16.tflite |
242 KB | Best accuracy, lazy eye support |
| Dual Eye INT8 | gaze_inception_lite_dual_int8.tflite |
267 KB | Best accuracy + speed combo |
ποΈ Architecture
Gated Inception Block
Input
βββ Branch 1: 1Γ1 Conv (point features) ββββ Γ gate[0]
βββ Branch 2: 1Γ1 β 3Γ3 DWConv (local) ββ Γ gate[1]
βββ Branch 3: 1Γ1 β 5Γ5 DWConv (wide) ββ Γ gate[2]
βββ Branch 4: MaxPool β 1Γ1 Conv (pool) ββ Γ gate[3]
β
Gate Network: GAP β Dense β Sigmoid βββββββββββββββββ
β
Output: Concat(gated branches) ββββββββββββββββββββββ
The gate values (0-1 sigmoid) are learned per-sample. For "easy" inputs (centered gaze, good lighting), the network learns to rely on fewer branches. For complex inputs (extreme gaze, dark, glasses), all branches activate. This provides adaptive computation β fast when possible, thorough when needed.
Full Pipeline (Dual Eye Model)
Left Eye (64Γ64) βββ
βββ Shared Eye Backbone βββ
Right Eye (64Γ64) βββ (Gated Inception Γ3 βββ Concat β Dense β (x,y)
+ CoordAttention) β
Face (64Γ64) ββββ Lightweight CNN ββββββββββββββ
π Quick Start (Python)
import tensorflow as tf
import numpy as np
# Load model
interpreter = tf.lite.Interpreter(model_path="gaze_inception_lite_single_f16.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Prepare eye crop (64x64 RGB, normalized to [0,1])
eye_crop = preprocess_eye(frame) # Your eye detection + crop function
eye_input = np.expand_dims(eye_crop, axis=0).astype(np.float32)
# Run inference
interpreter.set_tensor(input_details[0]['index'], eye_input)
interpreter.invoke()
# Get screen coordinates
gaze_xy = interpreter.get_tensor(output_details[0]['index'])[0]
screen_x = gaze_xy[0] * screen_width # pixels
screen_y = gaze_xy[1] * screen_height # pixels
print(f"Looking at: ({screen_x:.0f}, {screen_y:.0f})")
Android (Java/Kotlin)
val interpreter = Interpreter(loadModelFile("gaze_inception_lite_single_int8.tflite"))
val input = Array(1) { Array(64) { Array(64) { FloatArray(3) } } }
val output = Array(1) { FloatArray(2) }
// Fill input with preprocessed eye crop
interpreter.run(input, output)
val gazeX = output[0][0] * screenWidth
val gazeY = output[0][1] * screenHeight
π§ Training Details
- Data: 50,000 synthetic samples with comprehensive augmentations
- Augmentations: Dark conditions (30%), glasses (25%), lazy eye (15%), sensor noise (50%), illumination perturbation, diverse skin tones (12), eye colors (7)
- Optimizer: Adam with Cosine Decay LR (1e-3 β 1e-5)
- Loss: MSE on normalized (x,y) coordinates
- Architecture Inspiration:
- AGE Framework - augmentation pipeline
- Gated Compression Layers - gating mechanism
- iTracker/GazeCapture - dual-eye + face architecture
- Coordinate Attention - spatial attention
β οΈ Limitations
- Trained on synthetic data β fine-tuning on real gaze data (GazeCapture, ETH-XGaze) will significantly improve accuracy
- Screen coordinate output assumes front-facing phone camera centered above screen
- Requires separate face/eye detection (use MediaPipe Face Mesh for production)
- Lazy eye support is based on simulated strabismus β clinical validation needed
π License
MIT License β free for commercial and non-commercial use.