GazeInceptionLite / README.md
BcantCode's picture
Upload README.md with huggingface_hub
620775d verified
---
library_name: tensorflow
tags:
- eye-gaze-estimation
- tflite
- mobile
- gated-inception
- coordinate-attention
- on-device
- accessibility
license: mit
pipeline_tag: image-classification
---
# πŸ‘οΈ GazeInception-Lite: Mobile Eye Gaze Estimation
**Lightweight TFLite model that estimates where you're looking on a mobile phone screen.**
Built with a novel **Gated Inception** architecture that learns to skip unnecessary computation branches, making it extremely fast for on-device inference.
## ✨ Key Features
| Feature | Details |
|---------|---------|
| πŸ”¦ **Works in Dark** | Trained with illumination perturbation + low-light augmentation (down to 15% brightness) |
| πŸ‘“ **Glasses Support** | Trained with synthetic glasses overlay (10 frame styles, lens reflections) |
| πŸ‘οΈ **Lazy Eye / Strabismus** | Dual-eye architecture processes each eye independently with shared weights |
| ⚑ **Gated Inception** | Learned sigmoid gates skip inactive branches β†’ reduces useless compute |
| πŸ“± **Mobile-First** | 89,754 params (single) / 136,922 params (dual) |
| 🎯 **Coordinate Attention** | Encodes spatial position for precise iris localization |
## πŸ“Š Performance
### Accuracy
| Model | Screen Error | Inference (CPU) | FPS |
|-------|-------------|-----------------|-----|
| Single Eye (F16) | 4.2 mm | 0.59 ms | 1684 |
| Single Eye (INT8) | 4.3 mm | 0.62 ms | 1619 |
| Dual Eye (F16) | 14.2 mm | 1.50 ms | 666 |
| Dual Eye (INT8) | 14.3 mm | 0.93 ms | 1070 |
### Robustness (Dual Eye Model)
| Condition | Screen Error |
|-----------|-------------|
| Dark / Low-light | 13.8 mm |
| With Glasses | 13.9 mm |
| Lazy Eye / Strabismus | 13.5 mm |
## πŸ“¦ Available Models
| Model | File | Size | Best For |
|-------|------|------|----------|
| Single Eye F16 | `gaze_inception_lite_single_f16.tflite` | 161 KB | Ultra-low latency, simple apps |
| Single Eye INT8 | `gaze_inception_lite_single_int8.tflite` | 164 KB | Fastest on mobile NPU/DSP |
| Dual Eye F16 | `gaze_inception_lite_dual_f16.tflite` | 242 KB | Best accuracy, lazy eye support |
| Dual Eye INT8 | `gaze_inception_lite_dual_int8.tflite` | 267 KB | Best accuracy + speed combo |
## πŸ—οΈ Architecture
### Gated Inception Block
```
Input
β”œβ”€β”€ Branch 1: 1Γ—1 Conv (point features) ──── Γ— gate[0]
β”œβ”€β”€ Branch 2: 1Γ—1 β†’ 3Γ—3 DWConv (local) ── Γ— gate[1]
β”œβ”€β”€ Branch 3: 1Γ—1 β†’ 5Γ—5 DWConv (wide) ── Γ— gate[2]
└── Branch 4: MaxPool β†’ 1Γ—1 Conv (pool) ── Γ— gate[3]
β”‚
Gate Network: GAP β†’ Dense β†’ Sigmoid β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
Output: Concat(gated branches) β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
The **gate values** (0-1 sigmoid) are learned per-sample. For "easy" inputs (centered gaze, good lighting), the network learns to rely on fewer branches. For complex inputs (extreme gaze, dark, glasses), all branches activate. This provides **adaptive computation** β€” fast when possible, thorough when needed.
### Full Pipeline (Dual Eye Model)
```
Left Eye (64Γ—64) ──┐
β”œβ”€β”€ Shared Eye Backbone ──┐
Right Eye (64Γ—64) β”€β”€β”˜ (Gated Inception Γ—3 β”œβ”€β”€ Concat β†’ Dense β†’ (x,y)
+ CoordAttention) β”‚
Face (64Γ—64) ──── Lightweight CNN β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## πŸš€ Quick Start (Python)
```python
import tensorflow as tf
import numpy as np
# Load model
interpreter = tf.lite.Interpreter(model_path="gaze_inception_lite_single_f16.tflite")
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Prepare eye crop (64x64 RGB, normalized to [0,1])
eye_crop = preprocess_eye(frame) # Your eye detection + crop function
eye_input = np.expand_dims(eye_crop, axis=0).astype(np.float32)
# Run inference
interpreter.set_tensor(input_details[0]['index'], eye_input)
interpreter.invoke()
# Get screen coordinates
gaze_xy = interpreter.get_tensor(output_details[0]['index'])[0]
screen_x = gaze_xy[0] * screen_width # pixels
screen_y = gaze_xy[1] * screen_height # pixels
print(f"Looking at: ({screen_x:.0f}, {screen_y:.0f})")
```
### Android (Java/Kotlin)
```kotlin
val interpreter = Interpreter(loadModelFile("gaze_inception_lite_single_int8.tflite"))
val input = Array(1) { Array(64) { Array(64) { FloatArray(3) } } }
val output = Array(1) { FloatArray(2) }
// Fill input with preprocessed eye crop
interpreter.run(input, output)
val gazeX = output[0][0] * screenWidth
val gazeY = output[0][1] * screenHeight
```
## πŸ”§ Training Details
- **Data**: 50,000 synthetic samples with comprehensive augmentations
- **Augmentations**: Dark conditions (30%), glasses (25%), lazy eye (15%), sensor noise (50%), illumination perturbation, diverse skin tones (12), eye colors (7)
- **Optimizer**: Adam with Cosine Decay LR (1e-3 β†’ 1e-5)
- **Loss**: MSE on normalized (x,y) coordinates
- **Architecture Inspiration**:
- [AGE Framework](https://arxiv.org/abs/2603.26945) - augmentation pipeline
- [Gated Compression Layers](https://arxiv.org/abs/2303.08970) - gating mechanism
- [iTracker/GazeCapture](https://arxiv.org/abs/1606.05814) - dual-eye + face architecture
- [Coordinate Attention](https://arxiv.org/abs/2103.02907) - spatial attention
## ⚠️ Limitations
- Trained on **synthetic data** β€” fine-tuning on real gaze data (GazeCapture, ETH-XGaze) will significantly improve accuracy
- Screen coordinate output assumes front-facing phone camera centered above screen
- Requires separate face/eye detection (use MediaPipe Face Mesh for production)
- Lazy eye support is based on simulated strabismus β€” clinical validation needed
## πŸ“ License
MIT License β€” free for commercial and non-commercial use.