dmusingu
/

lapvqa-diffvqa-native

@@ -14,10 +14,8 @@ Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapv
 ## Description
-DiffVQA models trained **end-to-end** (encoder + task head jointly fine-tuned), providing
-a strong upper bound compared to the frozen-encoder variant in
-[`lapvqa-diffvqa`](https://huggingface.co/dmusingu/lapvqa-diffvqa).
-MAE-ViT-L/16 is the primary encoder studied in this native setting.
 ## Results (test set, MAE-ViT-L/16)
@@ -25,12 +23,22 @@ MAE-ViT-L/16 is the primary encoder studied in this native setting.
 |---|---|---|---|
 | 0.472 | 0.573 | 0.288 | 0.938 |
-## Files
-| File | Encoder backbone |
-|---|---|
-| `clip-vit-l14_best.pt` | CLIP ViT-L/14 (fine-tuned) |
-| `coca_best.pt` | CoCa (fine-tuned) |
-| `florence2_best.pt` | Florence-2 (fine-tuned) |
-| `mae-vit-l16_best.pt` | MAE ViT-L/16 (fine-tuned) |
-| `siglip_best.pt` | SigLIP (fine-tuned) |

 ## Description
+DiffVQA models trained **end-to-end** (encoder + head jointly). Each `.pt` file
+is a plain state dict of `DiffVQAHead`. MAE-ViT-L/16 is the primary encoder studied.
 ## Results (test set, MAE-ViT-L/16)
 |---|---|---|---|
 | 0.472 | 0.573 | 0.288 | 0.938 |
+| File | Encoder | vis_dim |
+|---|---|---|
+| `clip-vit-l14_best.pt` | CLIP ViT-L/14 | 1024 |
+| `coca_best.pt` | CoCa | 768 |
+| `florence2_best.pt` | Florence-2 | 1024 |
+| `mae-vit-l16_best.pt` | MAE ViT-L/16 | 1024 |
+| `siglip_best.pt` | SigLIP | 1152 |
+## Loading
+```python
+import torch
+from lapvqa.diffvqa.model import DiffVQAHead
+ckpt = torch.load("mae-vit-l16_best.pt", map_location="cpu")
+head = DiffVQAHead(vis_dim=1024)
+head.load_state_dict(ckpt)
+head.eval()
+```