Upload README.md with huggingface_hub

620775d verified 4 days ago

5.97 kB

library_name: tensorflow
tags:
  - eye-gaze-estimation
  - tflite
  - mobile
  - gated-inception
  - coordinate-attention
  - on-device
  - accessibility
license: mit
pipeline_tag: image-classification

👁️ GazeInception-Lite: Mobile Eye Gaze Estimation

Lightweight TFLite model that estimates where you're looking on a mobile phone screen.

Built with a novel Gated Inception architecture that learns to skip unnecessary computation branches, making it extremely fast for on-device inference.

✨ Key Features

Feature	Details
🔦 Works in Dark	Trained with illumination perturbation + low-light augmentation (down to 15% brightness)
👓 Glasses Support	Trained with synthetic glasses overlay (10 frame styles, lens reflections)
👁️ Lazy Eye / Strabismus	Dual-eye architecture processes each eye independently with shared weights
⚡ Gated Inception	Learned sigmoid gates skip inactive branches → reduces useless compute
📱 Mobile-First	89,754 params (single) / 136,922 params (dual)
🎯 Coordinate Attention	Encodes spatial position for precise iris localization

📊 Performance

Accuracy

Model	Screen Error	Inference (CPU)	FPS
Single Eye (F16)	4.2 mm	0.59 ms	1684
Single Eye (INT8)	4.3 mm	0.62 ms	1619
Dual Eye (F16)	14.2 mm	1.50 ms	666
Dual Eye (INT8)	14.3 mm	0.93 ms	1070

Robustness (Dual Eye Model)

Condition	Screen Error
Dark / Low-light	13.8 mm
With Glasses	13.9 mm
Lazy Eye / Strabismus	13.5 mm

📦 Available Models

Model	File	Size	Best For
Single Eye F16	`gaze_inception_lite_single_f16.tflite`	161 KB	Ultra-low latency, simple apps
Single Eye INT8	`gaze_inception_lite_single_int8.tflite`	164 KB	Fastest on mobile NPU/DSP
Dual Eye F16	`gaze_inception_lite_dual_f16.tflite`	242 KB	Best accuracy, lazy eye support
Dual Eye INT8	`gaze_inception_lite_dual_int8.tflite`	267 KB	Best accuracy + speed combo

🏗️ Architecture

Gated Inception Block

Input
  ├── Branch 1: 1×1 Conv (point features) ──── × gate[0]
  ├── Branch 2: 1×1 → 3×3 DWConv (local)  ── × gate[1]  
  ├── Branch 3: 1×1 → 5×5 DWConv (wide)  ── × gate[2]
  └── Branch 4: MaxPool → 1×1 Conv (pool)  ── × gate[3]
                                                    │
Gate Network: GAP → Dense → Sigmoid ────────────────┘
                                                    │
Output: Concat(gated branches) ◄────────────────────┘

The gate values (0-1 sigmoid) are learned per-sample. For "easy" inputs (centered gaze, good lighting), the network learns to rely on fewer branches. For complex inputs (extreme gaze, dark, glasses), all branches activate. This provides adaptive computation — fast when possible, thorough when needed.

Full Pipeline (Dual Eye Model)

Left Eye (64×64)  ──┐                      
                     ├── Shared Eye Backbone ──┐
Right Eye (64×64) ──┘   (Gated Inception ×3   ├── Concat → Dense → (x,y)
                         + CoordAttention)     │
Face (64×64) ──── Lightweight CNN ─────────────┘

🚀 Quick Start (Python)

import tensorflow as tf
import numpy as np

# Load model
interpreter = tf.lite.Interpreter(model_path="gaze_inception_lite_single_f16.tflite")
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Prepare eye crop (64x64 RGB, normalized to [0,1])
eye_crop = preprocess_eye(frame)  # Your eye detection + crop function
eye_input = np.expand_dims(eye_crop, axis=0).astype(np.float32)

# Run inference
interpreter.set_tensor(input_details[0]['index'], eye_input)
interpreter.invoke()

# Get screen coordinates
gaze_xy = interpreter.get_tensor(output_details[0]['index'])[0]
screen_x = gaze_xy[0] * screen_width   # pixels
screen_y = gaze_xy[1] * screen_height  # pixels
print(f"Looking at: ({screen_x:.0f}, {screen_y:.0f})")

Android (Java/Kotlin)

val interpreter = Interpreter(loadModelFile("gaze_inception_lite_single_int8.tflite"))
val input = Array(1) { Array(64) { Array(64) { FloatArray(3) } } }
val output = Array(1) { FloatArray(2) }

// Fill input with preprocessed eye crop
interpreter.run(input, output)

val gazeX = output[0][0] * screenWidth
val gazeY = output[0][1] * screenHeight

🔧 Training Details

Data: 50,000 synthetic samples with comprehensive augmentations
Augmentations: Dark conditions (30%), glasses (25%), lazy eye (15%), sensor noise (50%), illumination perturbation, diverse skin tones (12), eye colors (7)
Optimizer: Adam with Cosine Decay LR (1e-3 → 1e-5)
Loss: MSE on normalized (x,y) coordinates
Architecture Inspiration:
- AGE Framework - augmentation pipeline
- Gated Compression Layers - gating mechanism
- iTracker/GazeCapture - dual-eye + face architecture
- Coordinate Attention - spatial attention

⚠️ Limitations

Trained on synthetic data — fine-tuning on real gaze data (GazeCapture, ETH-XGaze) will significantly improve accuracy
Screen coordinate output assumes front-facing phone camera centered above screen
Requires separate face/eye detection (use MediaPipe Face Mesh for production)
Lazy eye support is based on simulated strabismus — clinical validation needed

📝 License

MIT License — free for commercial and non-commercial use.