Update README.md

c4ddbc4 verified 9 days ago

10.5 kB

	---
	license: apache-2.0
	tags:
	- age-estimation
	- gender-classification
	- face-analysis
	- vision-transformer
	- dinov3
	- coral-ordinal-regression
	pipeline_tag: image-classification
	---

	# FaceAge ClientScan

	> A face-only age estimation on LAGENDA 84k — MAE 3.555. The state-of-the-art specific task model : Mivolov2 on face+body MAE 3.65.

	Age and gender estimation from face crops using DINOv3-ViT-L backbone with CORAL ordinal regression.

	## Performance (LAGENDA 84k benchmark)

	\| Model \| Input \| MAE ↓ \| CS@5 ↑ \| Gender Acc ↑ \|
	\|-------\|-------\|--------\|--------\|-------------\|
	\| FaceAge ClientScan (ours) \| face-only \| 3.555 \| 75.5% \| 97.75% \|
	\| MiVOLO v2 (paper) \| face + body \| 3.650 \| 74.48% \| 97.99% \|
	\| MiVOLO v1 (paper) \| face + body \| 3.990 \| 71.27% \| 97.36% \|
	\| MiVOLO v2 (measured, face+body) \| face + body \| 3.859 \| 76.5% \| 96.96% \|
	\| MiVOLO v2 (measured, face-only) \| face only \| 4.224 \| 69.7% \| 96.05% \|

	Key result: FaceAge ClientScan achieves MAE=3.555 using only the face crop — no body information needed — outperforming MiVOLO v2's paper claim of 3.650 which requires both face and body bounding boxes.

	### Per age-group MAE (FaceAge ClientScan vs MiVOLO v2 best)

	\| Age Group \| n \| MiVOLO v2 best \| FaceAge ClientScan \| Delta \|
	\|-----------\|--:\|---------------:\|-------------------:\|------:\|
	\| 0–12 \| 15,369 \| 1.677 \| 1.548 \| ✅ −0.129 \|
	\| 13–17 \| 3,930 \| 3.365 \| 2.845 \| ✅ −0.520 \|
	\| 18–25 \| 9,975 \| 2.989 \| 2.877 \| ✅ −0.112 \|
	\| 26–35 \| 10,303 \| 3.348 \| 3.775 \| ❌ +0.427 \|
	\| 36–50 \| 19,234 \| 4.484 \| 4.195 \| ✅ −0.289 \|
	\| 51–65 \| 16,350 \| 4.794 \| 4.329 \| ✅ −0.465 \|
	\| 66+ \| 9,031 \| 6.310 \| 5.013 \| ✅ −1.297 \|
	\| Overall \| 84,192 \| 3.859 \| 3.555 \| ✅ −0.304 \|

	FaceAge ClientScan wins 6/7 age groups. The only group where MiVOLO v2 leads is 26–35, where body context likely helps.


	### on several age dataset
	The benchmark mechanism following this paper: [Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures](https://arxiv.org/pdf/2602.07815)

	\| Dataset \| Ours ONNX MAE \|
	\|---------\|-----------\|
	\| UTK \| 5.225 \|
	\| IMDB \| 5.119 \|
	\| MORPH \| 4.235 \|
	\| AFAD \| 3.520 \|
	\| CACD \| 5.314 \|
	\| FG-NET \| 4.550 \|
	\| APPA \| 5.172 \|
	\| AgeDB \| 5.933 \|
	\| Avg \| 4.884\|


	## Architecture

	```
	Face [B, 3, 224, 224] (+ 10% proportional bbox padding)
	↓
	DINOv3-ViT-L/16 (307M params, pretrained on LVD-1.68B)
	↓ pooler_output
	[B, 1024]
	↓ LayerNorm → Linear(1024→512) → GELU → Dropout(0.1)
	[B, 512]
	├── age_head: Linear(512, 100) → CORAL → age ∈ [0, 100]
	└── gender_head: Linear(512, 2) → softmax → {female, male}
	```

	CORAL ordinal regression: age = Σ σ(logit_k) for k=0..99. Exploits the ordinal structure of ages for better calibration than standard cross-entropy.

	Important: use 10% proportional padding when cropping the face bbox before inference — this matches the training setup and is required to reproduce MAE=3.555.

	## Face crop helper (required for MAE=3.555)

	Apply 10% proportional padding before passing to the model. This is critical — without it MAE degrades to ~3.758.

	```python
	import numpy as np
	from PIL import Image

	def crop_face(image_rgb: np.ndarray,
	x0: float, y0: float, x1: float, y1: float,
	pad: float = 0.10) -> np.ndarray:
	"""Crop face bbox with proportional padding. pad=0.10 → 10% each side."""
	h, w = image_rgb.shape[:2]
	pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
	x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph))
	x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph))
	return image_rgb[y0:y1, x0:x1]
	```

	## Batched inference (PyTorch — recommended for benchmarks)

	```python
	import torch
	import numpy as np
	import pandas as pd
	from PIL import Image
	from tqdm import tqdm
	from torch.utils.data import Dataset, DataLoader
	from transformers import AutoImageProcessor, AutoModel

	# Limit threads — on big servers PyTorch over-subscribes cores
	torch.set_num_threads(8)

	BATCH_SIZE = 32 # increase if you have enough RAM
	NUM_WORKERS = 8 # parallel image loading

	processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
	model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
	model.eval()


	def crop_face(image_rgb, x0, y0, x1, y1, pad=0.10):
	h, w = image_rgb.shape[:2]
	pw, ph = (x1 - x0) * pad, (y1 - y0) * pad
	x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph))
	x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph))
	return image_rgb[y0:y1, x0:x1]


	class FaceDataset(Dataset):
	def __init__(self, df, root, processor):
	self.df = df.reset_index(drop=True)
	self.root = root
	self.processor = processor

	def __len__(self):
	return len(self.df)

	def __getitem__(self, idx):
	row = self.df.iloc[idx]
	img_rgb = np.array(Image.open(self.root + row.img_name).convert("RGB"))
	face = crop_face(img_rgb, row.face_x0, row.face_y0, row.face_x1, row.face_y1)
	pixel_values = self.processor(images=Image.fromarray(face),
	return_tensors="pt")["pixel_values"][0]
	return pixel_values, row.img_name


	df = pd.read_csv("lagenda_annotation.csv")
	df = df[df.age != -1].reset_index(drop=True)
	ROOT = "/path/to/lagenda/"

	dataset = FaceDataset(df, ROOT, processor)
	loader = DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS,
	pin_memory=False)

	results = {}
	with torch.no_grad():
	for pixel_values, img_names in tqdm(loader, desc="Inference"):
	outputs = model(pixel_values=pixel_values)
	ages = outputs.age_output.tolist()
	genders = outputs.gender_class_idx.tolist()
	for name, age, g in zip(img_names, ages, genders):
	results[name] = {"age": age, "gender": "male" if g == 1 else "female"}
	```

	> Tip: for even faster CPU inference use the ONNX version (`infer_onnx.py`) which is ~3× faster than PyTorch on CPU.

	```python
	import torch
	import numpy as np
	from PIL import Image
	from transformers import AutoImageProcessor, AutoModel

	processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan")
	model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True)
	model.eval()

	# 1. Load full image and apply 10% padded crop
	img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
	face = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320) # your bbox here

	# 2. Run model
	inputs = processor(images=Image.fromarray(face), return_tensors="pt")
	with torch.no_grad():
	outputs = model(**inputs)

	age = outputs.age_output.item()
	gender = "male" if outputs.gender_class_idx.item() == 1 else "female"
	conf = outputs.gender_probs[0, outputs.gender_class_idx.item()].item()
	print(f"Age: {age:.1f} Gender: {gender} ({conf:.0%})")
	```

	## Usage (ONNX — no PyTorch needed)

	> Standalone inference script: [github.com/TrungThanhTran/faceage-ClientScan](https://github.com/TrungThanhTran/faceage-ClientScan)
	> — includes `infer_onnx.py` with auto-download, single image + LAGENDA benchmark modes.

	```bash
	git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
	cd faceage-ClientScan
	pip install -r requirements.txt

	# Single image
	python infer_onnx.py --image photo.jpg --bbox 120 80 300 320

	# LAGENDA MAE benchmark
	python infer_onnx.py \
	--lagenda_dir /path/to/lagenda \
	--annotation_csv lagenda_test.csv \
	--batch_size 256
	```

	Or use the Python API directly:

	```python
	from infer_onnx import FaceAgeModel, crop_face
	import numpy as np
	from PIL import Image

	model = FaceAgeModel() # auto-downloads ONNX from HuggingFace

	img = np.array(Image.open("photo.jpg").convert("RGB"))
	face = crop_face(img, x0=120, y0=80, x1=300, y1=320)
	out = model.predict(face)
	print(out) # {'age': 34.2, 'gender': 'male', 'gender_conf': 0.981}
	```

	Or raw ONNX (manual):

	```python
	import numpy as np
	import onnxruntime as ort
	from PIL import Image

	sess = ort.InferenceSession("faceage_dino_fp32.onnx",
	providers=["CPUExecutionProvider"])
	in_name = sess.get_inputs()[0].name

	MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32)
	STD = np.array([0.229, 0.224, 0.225], dtype=np.float32)

	def preprocess(img_rgb: np.ndarray) -> np.ndarray:
	"""HxWx3 uint8 RGB → [1,3,224,224] float32, ImageNet normalised."""
	pil = Image.fromarray(img_rgb).resize((224, 224), Image.BICUBIC)
	arr = np.asarray(pil, dtype=np.float32) / 255.0
	arr = (arr - MEAN) / STD
	return arr.transpose(2, 0, 1)[np.newaxis]

	# 1. Load image, apply 10% padded crop
	img_rgb = np.array(Image.open("photo.jpg").convert("RGB"))
	face = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320) # your bbox here

	# 2. Run ONNX
	age_logits, gender_logits = sess.run(None, {in_name: preprocess(face)})
	age = float((1 / (1 + np.exp(-age_logits[0]))).sum()) # CORAL decode
	gender = "male" if gender_logits[0].argmax() == 1 else "female"
	print(f"Age: {age:.1f} Gender: {gender}")
	```

	## Reproducing MAE=3.555

	```bash
	git clone https://github.com/TrungThanhTran/faceage-ClientScan.git
	cd faceage-ClientScan
	pip install -r requirements.txt

	python infer_onnx.py \
	--lagenda_dir /path/to/lagenda \
	--annotation_csv lagenda_test.csv \
	--batch_size 256
	```

	## Training

	Multi-phase fine-tuning on DINOv3-ViT-L:

	\| Phase \| Backbone \| LR \| Key change \|
	\|-------\|----------\|----\|-----------\|
	\| 1 \| Frozen (all 24 blocks) \| 1e-3 \| Head training only \|
	\| 2 \| Top 4 blocks unfrozen \| 1e-4 \| Partial fine-tuning \|
	\| 3 \| All blocks unfrozen \| 3e-5 \| Full fine-tuning \|
	\| 4 \| All blocks \| 3e-6 \| Age-group reweighting, best epoch MAE=3.555 \|

	Training data: Our Collection (4M images).

	## Citation

	```bibtex
	@misc{faceage-clientscan-2026,
	title = {FaceAge ClientScan: Face-Only Age \& Gender Estimation},
	author = {Trung Thanh Tran},
	year = {2026},
	url = {https://huggingface.co/TrungTran/faceage_ClientScan}
	}
	```

	Related work:
	- DINOv3: Meta AI, "DINOv3: Scaling Up Vision Foundation Models", 2025
	- MiVOLO: Kuprashevich & Tolstykh, arXiv 2307.04616
	- LAGENDA: Bhuiyan et al., 2023
	- CORAL: Cao et al., Pattern Recognition Letters 2020