File size: 2,124 Bytes
e069630
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcb405b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e069630
 
 
 
 
 
 
 
 
 
bcb405b
 
 
 
 
 
 
 
 
 
 
 
 
 
e069630
bcb405b
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
tags:
- chest-xray
- radiology
- visual-question-answering
- differential-vqa
- mimic-cxr
license: apache-2.0
---

# LAPVQA — Differential VQA (Frozen Off-the-shelf Encoders)

Part of the [LAPVQA collection](https://huggingface.co/collections/dmusingu/lapvqa).

## Description

Task heads for **Differential VQA**: given a *prior* and a *current* chest X-ray,
answer questions about radiological changes. Trained on MIMIC-Diff-VQA with five
**frozen** encoders. Each `.pt` file is a plain state dict of `DiffVQAHead`.

## Architecture — `DiffVQAHead`

```
vis_proj   : Linear(vis_dim → 512)   # shared for both images
frame_emb  : Embedding(2, 512)       # 0=reference, 1=current
memory     : [ref_proj + frame_emb(0) ; curr_proj + frame_emb(1)]  → [B, 2N, 512]
tok_emb    : Embedding(50257, 512)
pos_emb    : Embedding(200, 512)
decoder    : 6 × TransformerDecoderLayer (pre-norm)
lm_head    : Linear(512 → 50257, bias=False)
```

| File | Encoder | vis_dim |
|---|---|---|
| `clip-vit-l14_best.pt` | CLIP ViT-L/14 | 1024 |
| `coca_best.pt` | CoCa | 768 |
| `florence2_best.pt` | Florence-2 | 1024 |
| `siglip_best.pt` | SigLIP | 1152 |
| `owlv2_best.pt` | OWLv2 | 1024 |

## Results (test set)

| Encoder | BLEU-1 | BLEU-4 | ROUGE-1 | RadGraph-s |
|---|---|---|---|---|
| CLIP ViT-L/14 | 0.184 | 0.128 | 0.336 | 0.322 |
| CoCa | 0.196 | 0.138 | 0.320 | 0.317 |
| Florence-2 | 0.191 | 0.138 | 0.319 | 0.318 |
| SigLIP | 0.186 | 0.131 | 0.322 | 0.313 |

## Loading

```python
import torch
import tiktoken
from lapvqa.diffvqa.model import DiffVQAHead

ckpt = torch.load("coca_best.pt", map_location="cpu")
head = DiffVQAHead(vis_dim=768)   # adjust vis_dim per encoder
head.load_state_dict(ckpt)
head.eval()

enc = tiktoken.get_encoding("gpt2")
bos_id = eos_id = enc.eot_token

# curr_vis, ref_vis: [B, N, vis_dim] — patch tokens from the frozen encoder
answers = head.generate(
    curr_vis    = curr_vis,
    ref_vis     = ref_vis,
    prompt_ids  = question_ids,   # [B, Q]
    bos_id      = bos_id,
    eos_id      = eos_id,
    max_new_tokens = 128,
)
decoded = [enc.decode(ids) for ids in answers]
```