faceage_ClientScan / README.md
TrungTran's picture
Update README.md
c4ddbc4 verified
---
license: apache-2.0
tags:
- age-estimation
- gender-classification
- face-analysis
- vision-transformer
- dinov3
- coral-ordinal-regression
pipeline_tag: image-classification
---
# FaceAge ClientScan
> **A face-only age estimation on LAGENDA 84k β€” MAE 3.555. The state-of-the-art specific task model : Mivolov2 on face+body MAE 3.65. **
Age and gender estimation from face crops using **DINOv3-ViT-L** backbone with CORAL ordinal regression.
## Performance (LAGENDA 84k benchmark)
| Model | Input | MAE ↓ | CS@5 ↑ | Gender Acc ↑ |
|-------|-------|--------|--------|-------------|
| **FaceAge ClientScan (ours)** | **face-only** | **3.555** | **75.5%** | **97.75%** |
| MiVOLO v2 (paper) | face + body | 3.650 | 74.48% | 97.99% |
| MiVOLO v1 (paper) | face + body | 3.990 | 71.27% | 97.36% |
| MiVOLO v2 (measured, face+body) | face + body | 3.859 | 76.5% | 96.96% |
| MiVOLO v2 (measured, face-only) | face only | 4.224 | 69.7% | 96.05% |
**Key result**: FaceAge ClientScan achieves **MAE=3.555** using only the face crop β€” no body information needed β€” outperforming MiVOLO v2's paper claim of 3.650 which requires both face and body bounding boxes.
### Per age-group MAE (FaceAge ClientScan vs MiVOLO v2 best)
| Age Group | n | MiVOLO v2 best | **FaceAge ClientScan** | Delta |
|-----------|--:|---------------:|-------------------:|------:|
| 0–12 | 15,369 | 1.677 | **1.548** | βœ… βˆ’0.129 |
| 13–17 | 3,930 | 3.365 | **2.845** | βœ… βˆ’0.520 |
| 18–25 | 9,975 | 2.989 | **2.877** | βœ… βˆ’0.112 |
| 26–35 | 10,303 | **3.348** | 3.775 | ❌ +0.427 |
| 36–50 | 19,234 | 4.484 | **4.195** | βœ… βˆ’0.289 |
| 51–65 | 16,350 | 4.794 | **4.329** | βœ… βˆ’0.465 |
| 66+ | 9,031 | 6.310 | **5.013** | βœ… βˆ’1.297 |
| **Overall** | **84,192** | 3.859 | **3.555** | βœ… βˆ’0.304 |
FaceAge ClientScan wins **6/7 age groups**. The only group where MiVOLO v2 leads is 26–35, where body context likely helps.
### on several age dataset
The benchmark mechanism following this paper: [Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures](https://arxiv.org/pdf/2602.07815)
| Dataset | Ours ONNX MAE |
|---------|-----------|
| UTK | 5.225 |
| IMDB | 5.119 |
| MORPH | 4.235 |
| AFAD | 3.520 |
| CACD | 5.314 |
| FG-NET | 4.550 |
| APPA | 5.172 |
| AgeDB | 5.933 |
| **Avg** | **4.884**|
## Architecture
```
Face [B, 3, 224, 224] (+ 10% proportional bbox padding)
↓
DINOv3-ViT-L/16 (307M params, pretrained on LVD-1.68B)
↓ pooler_output
[B, 1024]
↓ LayerNorm β†’ Linear(1024β†’512) β†’ GELU β†’ Dropout(0.1)
[B, 512]
β”œβ”€β”€ age_head: Linear(512, 100) β†’ CORAL β†’ age ∈ [0, 100]
└── gender_head: Linear(512, 2) β†’ softmax β†’ {female, male}
```
**CORAL ordinal regression**: age = Ξ£ Οƒ(logit_k) for k=0..99. Exploits the ordinal structure of ages for better calibration than standard cross-entropy.
**Important**: use 10% proportional padding when cropping the face bbox before inference β€” this matches the training setup and is required to reproduce MAE=3.555.
## Face crop helper (required for MAE=3.555)
Apply **10% proportional padding** before passing to the model. This is critical β€” without it MAE degrades to ~3.758.
```python
import numpy as np
from PIL import Image
def crop_face(image_rgb: np.ndarray,
x0: float, y0: float, x1: float, y1: float,
pad: float = 0.10) -> np.ndarray:
"""Crop face bbox with proportional padding. pad=0.10 β†’ 10% each side."""
h, w = image_rgb.shape[:2]
pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph))
x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph))
return image_rgb[y0:y1, x0:x1]
```
## Batched inference (PyTorch β€” recommended for benchmarks)
```python
import torch
import numpy as np
import pandas as pd
from PIL import Image
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import AutoImageProcessor, AutoModel
# Limit threads β€” on big servers PyTorch over-subscribes cores
torch.set_num_threads(8)
BATCH_SIZE = 32 # increase if you have enough RAM
NUM_WORKERS = 8 # parallel image loading
processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
model.eval()
def crop_face(image_rgb, x0, y0, x1, y1, pad=0.10):
h, w = image_rgb.shape[:2]
pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph))
x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph))
return image_rgb[y0:y1, x0:x1]
class FaceDataset(Dataset):
def __init__(self, df, root, processor):
self.df = df.reset_index(drop=True)
self.root = root
self.processor = processor
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
row = self.df.iloc[idx]
img_rgb = np.array(Image.open(self.root + row.img_name).convert("RGB"))
face = crop_face(img_rgb, row.face_x0, row.face_y0, row.face_x1, row.face_y1)
pixel_values = self.processor(images=Image.fromarray(face),
return_tensors="pt")["pixel_values"][0]
return pixel_values, row.img_name
df = pd.read_csv("lagenda_annotation.csv")
df = df[df.age != -1].reset_index(drop=True)
ROOT = "/path/to/lagenda/"
dataset = FaceDataset(df, ROOT, processor)
loader = DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS,
pin_memory=False)
results = {}
with torch.no_grad():
for pixel_values, img_names in tqdm(loader, desc="Inference"):
outputs = model(pixel_values=pixel_values)
ages = outputs.age_output.tolist()
genders = outputs.gender_class_idx.tolist()
for name, age, g in zip(img_names, ages, genders):
results[name] = {"age": age, "gender": "male" if g == 1 else "female"}
```
> **Tip**: for even faster CPU inference use the ONNX version (`infer_onnx.py`) which is ~3Γ— faster than PyTorch on CPU.
```python
import torch
import numpy as np
from PIL import Image
from transformers import AutoImageProcessor, AutoModel
processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
model.eval()
# 1. Load full image and apply 10% padded crop
img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
face = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320) # your bbox here
# 2. Run model
inputs = processor(images=Image.fromarray(face), return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
age = outputs.age_output.item()
gender = "male" if outputs.gender_class_idx.item() == 1 else "female"
conf = outputs.gender_probs[0, outputs.gender_class_idx.item()].item()
print(f"Age: {age:.1f} Gender: {gender} ({conf:.0%})")
```
## Usage (ONNX β€” no PyTorch needed)
> **Standalone inference script**: [github.com/TrungThanhTran/faceage-ClientScan](https://github.com/TrungThanhTran/faceage-ClientScan)
> β€” includes `infer_onnx.py` with auto-download, single image + LAGENDA benchmark modes.
```bash
git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
cd faceage-ClientScan
pip install -r requirements.txt
# Single image
python infer_onnx.py --image photo.jpg --bbox 120 80 300 320
# LAGENDA MAE benchmark
python infer_onnx.py \
--lagenda_dir /path/to/lagenda \
--annotation_csv lagenda_test.csv \
--batch_size 256
```
Or use the Python API directly:
```python
from infer_onnx import FaceAgeModel, crop_face
import numpy as np
from PIL import Image
model = FaceAgeModel() # auto-downloads ONNX from HuggingFace
img = np.array(Image.open("photo.jpg").convert("RGB"))
face = crop_face(img, x0=120, y0=80, x1=300, y1=320)
out = model.predict(face)
print(out) # {'age': 34.2, 'gender': 'male', 'gender_conf': 0.981}
```
Or raw ONNX (manual):
```python
import numpy as np
import onnxruntime as ort
from PIL import Image
sess = ort.InferenceSession("faceage_dino_fp32.onnx",
providers=["CPUExecutionProvider"])
in_name = sess.get_inputs()[0].name
MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
STD = np.array([0.229, 0.224, 0.225], dtype=np.float32)
def preprocess(img_rgb: np.ndarray) -> np.ndarray:
"""HxWx3 uint8 RGB β†’ [1,3,224,224] float32, ImageNet normalised."""
pil = Image.fromarray(img_rgb).resize((224, 224), Image.BICUBIC)
arr = np.asarray(pil, dtype=np.float32) / 255.0
arr = (arr - MEAN) / STD
return arr.transpose(2, 0, 1)[np.newaxis]
# 1. Load image, apply 10% padded crop
img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
face = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320) # your bbox here
# 2. Run ONNX
age_logits, gender_logits = sess.run(None, {in_name: preprocess(face)})
age = float((1 / (1 + np.exp(-age_logits[0]))).sum()) # CORAL decode
gender = "male" if gender_logits[0].argmax() == 1 else "female"
print(f"Age: {age:.1f} Gender: {gender}")
```
## Reproducing MAE=3.555
```bash
git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
cd faceage-ClientScan
pip install -r requirements.txt
python infer_onnx.py \
--lagenda_dir /path/to/lagenda \
--annotation_csv lagenda_test.csv \
--batch_size 256
```
## Training
Multi-phase fine-tuning on DINOv3-ViT-L:
| Phase | Backbone | LR | Key change |
|-------|----------|----|-----------|
| 1 | Frozen (all 24 blocks) | 1e-3 | Head training only |
| 2 | Top 4 blocks unfrozen | 1e-4 | Partial fine-tuning |
| 3 | All blocks unfrozen | 3e-5 | Full fine-tuning |
| 4 | All blocks | 3e-6 | Age-group reweighting, best epoch MAE=3.555 |
Training data: Our Collection (4M images).
## Citation
```bibtex
@misc{faceage-clientscan-2026,
title = {FaceAge ClientScan: Face-Only Age \& Gender Estimation},
author = {Trung Thanh Tran},
year = {2026},
url = {https://huggingface.co/TrungTran/faceage_ClientScan}
}
```
Related work:
- DINOv3: Meta AI, "DINOv3: Scaling Up Vision Foundation Models", 2025
- MiVOLO: Kuprashevich & Tolstykh, arXiv 2307.04616
- LAGENDA: Bhuiyan et al., 2023
- CORAL: Cao et al., Pattern Recognition Letters 2020