| --- |
| license: apache-2.0 |
| tags: |
| - age-estimation |
| - gender-classification |
| - face-analysis |
| - vision-transformer |
| - dinov3 |
| - coral-ordinal-regression |
| pipeline_tag: image-classification |
| --- |
| |
| # FaceAge ClientScan |
|
|
| > **A face-only age estimation on LAGENDA 84k β MAE 3.555. The state-of-the-art specific task model : Mivolov2 on face+body MAE 3.65. ** |
|
|
| Age and gender estimation from face crops using **DINOv3-ViT-L** backbone with CORAL ordinal regression. |
|
|
| ## Performance (LAGENDA 84k benchmark) |
|
|
| | Model | Input | MAE β | CS@5 β | Gender Acc β | |
| |-------|-------|--------|--------|-------------| |
| | **FaceAge ClientScan (ours)** | **face-only** | **3.555** | **75.5%** | **97.75%** | |
| | MiVOLO v2 (paper) | face + body | 3.650 | 74.48% | 97.99% | |
| | MiVOLO v1 (paper) | face + body | 3.990 | 71.27% | 97.36% | |
| | MiVOLO v2 (measured, face+body) | face + body | 3.859 | 76.5% | 96.96% | |
| | MiVOLO v2 (measured, face-only) | face only | 4.224 | 69.7% | 96.05% | |
|
|
| **Key result**: FaceAge ClientScan achieves **MAE=3.555** using only the face crop β no body information needed β outperforming MiVOLO v2's paper claim of 3.650 which requires both face and body bounding boxes. |
|
|
| ### Per age-group MAE (FaceAge ClientScan vs MiVOLO v2 best) |
|
|
| | Age Group | n | MiVOLO v2 best | **FaceAge ClientScan** | Delta | |
| |-----------|--:|---------------:|-------------------:|------:| |
| | 0β12 | 15,369 | 1.677 | **1.548** | β
β0.129 | |
| | 13β17 | 3,930 | 3.365 | **2.845** | β
β0.520 | |
| | 18β25 | 9,975 | 2.989 | **2.877** | β
β0.112 | |
| | 26β35 | 10,303 | **3.348** | 3.775 | β +0.427 | |
| | 36β50 | 19,234 | 4.484 | **4.195** | β
β0.289 | |
| | 51β65 | 16,350 | 4.794 | **4.329** | β
β0.465 | |
| | 66+ | 9,031 | 6.310 | **5.013** | β
β1.297 | |
| | **Overall** | **84,192** | 3.859 | **3.555** | β
β0.304 | |
|
|
| FaceAge ClientScan wins **6/7 age groups**. The only group where MiVOLO v2 leads is 26β35, where body context likely helps. |
|
|
|
|
| ### on several age dataset |
| The benchmark mechanism following this paper: [Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures](https://arxiv.org/pdf/2602.07815) |
|
|
| | Dataset | Ours ONNX MAE | |
| |---------|-----------| |
| | UTK | 5.225 | |
| | IMDB | 5.119 | |
| | MORPH | 4.235 | |
| | AFAD | 3.520 | |
| | CACD | 5.314 | |
| | FG-NET | 4.550 | |
| | APPA | 5.172 | |
| | AgeDB | 5.933 | |
| | **Avg** | **4.884**| |
|
|
|
|
| ## Architecture |
|
|
| ``` |
| Face [B, 3, 224, 224] (+ 10% proportional bbox padding) |
| β |
| DINOv3-ViT-L/16 (307M params, pretrained on LVD-1.68B) |
| β pooler_output |
| [B, 1024] |
| β LayerNorm β Linear(1024β512) β GELU β Dropout(0.1) |
| [B, 512] |
| βββ age_head: Linear(512, 100) β CORAL β age β [0, 100] |
| βββ gender_head: Linear(512, 2) β softmax β {female, male} |
| ``` |
|
|
| **CORAL ordinal regression**: age = Ξ£ Ο(logit_k) for k=0..99. Exploits the ordinal structure of ages for better calibration than standard cross-entropy. |
| |
| **Important**: use 10% proportional padding when cropping the face bbox before inference β this matches the training setup and is required to reproduce MAE=3.555. |
| |
| ## Face crop helper (required for MAE=3.555) |
| |
| Apply **10% proportional padding** before passing to the model. This is critical β without it MAE degrades to ~3.758. |
| |
| ```python |
| import numpy as np |
| from PIL import Image |
| |
| def crop_face(image_rgb: np.ndarray, |
| x0: float, y0: float, x1: float, y1: float, |
| pad: float = 0.10) -> np.ndarray: |
| """Crop face bbox with proportional padding. pad=0.10 β 10% each side.""" |
| h, w = image_rgb.shape[:2] |
| pw, ph = (x1 - x0) * pad, (y1 - y0) * pad |
| x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph)) |
| x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph)) |
| return image_rgb[y0:y1, x0:x1] |
| ``` |
| |
| ## Batched inference (PyTorch β recommended for benchmarks) |
|
|
| ```python |
| import torch |
| import numpy as np |
| import pandas as pd |
| from PIL import Image |
| from tqdm import tqdm |
| from torch.utils.data import Dataset, DataLoader |
| from transformers import AutoImageProcessor, AutoModel |
| |
| # Limit threads β on big servers PyTorch over-subscribes cores |
| torch.set_num_threads(8) |
| |
| BATCH_SIZE = 32 # increase if you have enough RAM |
| NUM_WORKERS = 8 # parallel image loading |
| |
| processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan") |
| model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True) |
| model.eval() |
| |
| |
| def crop_face(image_rgb, x0, y0, x1, y1, pad=0.10): |
| h, w = image_rgb.shape[:2] |
| pw, ph = (x1 - x0) * pad, (y1 - y0) * pad |
| x0 = max(0, int(x0 - pw)); y0 = max(0, int(y0 - ph)) |
| x1 = min(w, int(x1 + pw)); y1 = min(h, int(y1 + ph)) |
| return image_rgb[y0:y1, x0:x1] |
| |
| |
| class FaceDataset(Dataset): |
| def __init__(self, df, root, processor): |
| self.df = df.reset_index(drop=True) |
| self.root = root |
| self.processor = processor |
| |
| def __len__(self): |
| return len(self.df) |
| |
| def __getitem__(self, idx): |
| row = self.df.iloc[idx] |
| img_rgb = np.array(Image.open(self.root + row.img_name).convert("RGB")) |
| face = crop_face(img_rgb, row.face_x0, row.face_y0, row.face_x1, row.face_y1) |
| pixel_values = self.processor(images=Image.fromarray(face), |
| return_tensors="pt")["pixel_values"][0] |
| return pixel_values, row.img_name |
| |
| |
| df = pd.read_csv("lagenda_annotation.csv") |
| df = df[df.age != -1].reset_index(drop=True) |
| ROOT = "/path/to/lagenda/" |
| |
| dataset = FaceDataset(df, ROOT, processor) |
| loader = DataLoader(dataset, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS, |
| pin_memory=False) |
| |
| results = {} |
| with torch.no_grad(): |
| for pixel_values, img_names in tqdm(loader, desc="Inference"): |
| outputs = model(pixel_values=pixel_values) |
| ages = outputs.age_output.tolist() |
| genders = outputs.gender_class_idx.tolist() |
| for name, age, g in zip(img_names, ages, genders): |
| results[name] = {"age": age, "gender": "male" if g == 1 else "female"} |
| ``` |
|
|
| > **Tip**: for even faster CPU inference use the ONNX version (`infer_onnx.py`) which is ~3Γ faster than PyTorch on CPU. |
| |
| ```python |
| import torch |
| import numpy as np |
| from PIL import Image |
| from transformers import AutoImageProcessor, AutoModel |
| |
| processor = AutoImageProcessor.from_pretrained("TrungTran/faceage_ClientScan") |
| model = AutoModel.from_pretrained("TrungTran/faceage_ClientScan", trust_remote_code=True) |
| model.eval() |
| |
| # 1. Load full image and apply 10% padded crop |
| img_rgb = np.array(Image.open("photo.jpg").convert("RGB")) |
| face = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320) # your bbox here |
|
|
| # 2. Run model |
| inputs = processor(images=Image.fromarray(face), return_tensors="pt") |
| with torch.no_grad(): |
| outputs = model(**inputs) |
| |
| age = outputs.age_output.item() |
| gender = "male" if outputs.gender_class_idx.item() == 1 else "female" |
| conf = outputs.gender_probs[0, outputs.gender_class_idx.item()].item() |
| print(f"Age: {age:.1f} Gender: {gender} ({conf:.0%})") |
| ``` |
| |
| ## Usage (ONNX β no PyTorch needed) |
| |
| > **Standalone inference script**: [github.com/TrungThanhTran/faceage-ClientScan](https://github.com/TrungThanhTran/faceage-ClientScan) |
| > β includes `infer_onnx.py` with auto-download, single image + LAGENDA benchmark modes. |
| |
| ```bash |
| git clone https://github.com/TrungThanhTran/faceage-ClientScan.git |
| cd faceage-ClientScan |
| pip install -r requirements.txt |
|
|
| # Single image |
| python infer_onnx.py --image photo.jpg --bbox 120 80 300 320 |
| |
| # LAGENDA MAE benchmark |
| python infer_onnx.py \ |
| --lagenda_dir /path/to/lagenda \ |
| --annotation_csv lagenda_test.csv \ |
| --batch_size 256 |
| ``` |
| |
| Or use the Python API directly: |
|
|
| ```python |
| from infer_onnx import FaceAgeModel, crop_face |
| import numpy as np |
| from PIL import Image |
| |
| model = FaceAgeModel() # auto-downloads ONNX from HuggingFace |
| |
| img = np.array(Image.open("photo.jpg").convert("RGB")) |
| face = crop_face(img, x0=120, y0=80, x1=300, y1=320) |
| out = model.predict(face) |
| print(out) # {'age': 34.2, 'gender': 'male', 'gender_conf': 0.981} |
| ``` |
|
|
| Or raw ONNX (manual): |
|
|
| ```python |
| import numpy as np |
| import onnxruntime as ort |
| from PIL import Image |
| |
| sess = ort.InferenceSession("faceage_dino_fp32.onnx", |
| providers=["CPUExecutionProvider"]) |
| in_name = sess.get_inputs()[0].name |
| |
| MEAN = np.array([0.485, 0.456, 0.406], dtype=np.float32) |
| STD = np.array([0.229, 0.224, 0.225], dtype=np.float32) |
| |
| def preprocess(img_rgb: np.ndarray) -> np.ndarray: |
| """HxWx3 uint8 RGB β [1,3,224,224] float32, ImageNet normalised.""" |
| pil = Image.fromarray(img_rgb).resize((224, 224), Image.BICUBIC) |
| arr = np.asarray(pil, dtype=np.float32) / 255.0 |
| arr = (arr - MEAN) / STD |
| return arr.transpose(2, 0, 1)[np.newaxis] |
| |
| # 1. Load image, apply 10% padded crop |
| img_rgb = np.array(Image.open("photo.jpg").convert("RGB")) |
| face = crop_face(img_rgb, x0=120, y0=80, x1=300, y1=320) # your bbox here |
| |
| # 2. Run ONNX |
| age_logits, gender_logits = sess.run(None, {in_name: preprocess(face)}) |
| age = float((1 / (1 + np.exp(-age_logits[0]))).sum()) # CORAL decode |
| gender = "male" if gender_logits[0].argmax() == 1 else "female" |
| print(f"Age: {age:.1f} Gender: {gender}") |
| ``` |
|
|
| ## Reproducing MAE=3.555 |
|
|
| ```bash |
| git clone https://github.com/TrungThanhTran/faceage-ClientScan.git |
| cd faceage-ClientScan |
| pip install -r requirements.txt |
| |
| python infer_onnx.py \ |
| --lagenda_dir /path/to/lagenda \ |
| --annotation_csv lagenda_test.csv \ |
| --batch_size 256 |
| ``` |
|
|
| ## Training |
|
|
| Multi-phase fine-tuning on DINOv3-ViT-L: |
|
|
| | Phase | Backbone | LR | Key change | |
| |-------|----------|----|-----------| |
| | 1 | Frozen (all 24 blocks) | 1e-3 | Head training only | |
| | 2 | Top 4 blocks unfrozen | 1e-4 | Partial fine-tuning | |
| | 3 | All blocks unfrozen | 3e-5 | Full fine-tuning | |
| | 4 | All blocks | 3e-6 | Age-group reweighting, best epoch MAE=3.555 | |
|
|
| Training data: Our Collection (4M images). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{faceage-clientscan-2026, |
| title = {FaceAge ClientScan: Face-Only Age \& Gender Estimation}, |
| author = {Trung Thanh Tran}, |
| year = {2026}, |
| url = {https://huggingface.co/TrungTran/faceage_ClientScan} |
| } |
| ``` |
|
|
| Related work: |
| - DINOv3: Meta AI, "DINOv3: Scaling Up Vision Foundation Models", 2025 |
| - MiVOLO: Kuprashevich & Tolstykh, arXiv 2307.04616 |
| - LAGENDA: Bhuiyan et al., 2023 |
| - CORAL: Cao et al., Pattern Recognition Letters 2020 |
|
|