File size: 2,124 Bytes
e069630 bcb405b e069630 bcb405b e069630 bcb405b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | ---
tags:
- chest-xray
- radiology
- visual-question-answering
- differential-vqa
- mimic-cxr
license: apache-2.0
---
# LAPVQA — Differential VQA (Frozen Off-the-shelf Encoders)
Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).
## Description
Task heads for **Differential VQA**: given a *prior* and a *current* chest X-ray,
answer questions about radiological changes. Trained on MIMIC-Diff-VQA with five
**frozen** encoders. Each `.pt` file is a plain state dict of `DiffVQAHead`.
## Architecture — `DiffVQAHead`
```
vis_proj : Linear(vis_dim → 512) # shared for both images
frame_emb : Embedding(2, 512) # 0=reference, 1=current
memory : [ref_proj + frame_emb(0) ; curr_proj + frame_emb(1)] → [B, 2N, 512]
tok_emb : Embedding(50257, 512)
pos_emb : Embedding(200, 512)
decoder : 6 × TransformerDecoderLayer (pre-norm)
lm_head : Linear(512 → 50257, bias=False)
```
| File | Encoder | vis_dim |
|---|---|---|
| `clip-vit-l14_best.pt` | CLIP ViT-L/14 | 1024 |
| `coca_best.pt` | CoCa | 768 |
| `florence2_best.pt` | Florence-2 | 1024 |
| `siglip_best.pt` | SigLIP | 1152 |
| `owlv2_best.pt` | OWLv2 | 1024 |
## Results (test set)
| Encoder | BLEU-1 | BLEU-4 | ROUGE-1 | RadGraph-s |
|---|---|---|---|---|
| CLIP ViT-L/14 | 0.184 | 0.128 | 0.336 | 0.322 |
| CoCa | 0.196 | 0.138 | 0.320 | 0.317 |
| Florence-2 | 0.191 | 0.138 | 0.319 | 0.318 |
| SigLIP | 0.186 | 0.131 | 0.322 | 0.313 |
## Loading
```python
import torch
import tiktoken
from lapvqa.diffvqa.model import DiffVQAHead
ckpt = torch.load("coca_best.pt", map_location="cpu")
head = DiffVQAHead(vis_dim=768) # adjust vis_dim per encoder
head.load_state_dict(ckpt)
head.eval()
enc = tiktoken.get_encoding("gpt2")
bos_id = eos_id = enc.eot_token
# curr_vis, ref_vis: [B, N, vis_dim] — patch tokens from the frozen encoder
answers = head.generate(
curr_vis = curr_vis,
ref_vis = ref_vis,
prompt_ids = question_ids, # [B, Q]
bos_id = bos_id,
eos_id = eos_id,
max_new_tokens = 128,
)
decoded = [enc.decode(ids) for ids in answers]
```
|