Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,121 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
+
---
|
| 5 |
+
language: en
|
| 6 |
+
tags:
|
| 7 |
+
- vision
|
| 8 |
+
- text
|
| 9 |
+
- multimodal
|
| 10 |
+
- comics
|
| 11 |
+
- contrastive-learning
|
| 12 |
+
- feature-extraction
|
| 13 |
+
license: mit
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Comic Panel Encoder v1 (Stage 3)
|
| 17 |
+
|
| 18 |
+
This model is a multimodal encoder specifically designed to generate rich, dense feature representations (embeddings) of individual comic book panels. It serves as "Stage 3" of the [Comic Analysis Framework v2.0](https://github.com/RichardScottOZ/Comic-Analysis).
|
| 19 |
+
|
| 20 |
+
By combining visual details, extracted text (dialogue/narration), and compositional metadata (bounding box coordinates), it generates a single **512-dimensional vector** per panel. These embeddings are highly optimized for downstream sequential narrative modeling (Stage 4) and comic retrieval tasks.
|
| 21 |
+
|
| 22 |
+
## Model Architecture
|
| 23 |
+
|
| 24 |
+
The `comic-panel-encoder-v1` utilizes an **Adaptive Multi-Modal Fusion** architecture:
|
| 25 |
+
|
| 26 |
+
1. **Visual Branch (Dual Backbone):**
|
| 27 |
+
- **SigLIP** (`google/siglip-base-patch16-224`): Captures high-level semantic and stylistic features.
|
| 28 |
+
- **ResNet50**: Captures fine-grained, low-level texture and structural details.
|
| 29 |
+
- *Fusion:* An attention mechanism fuses the domain-adapted outputs of both backbones.
|
| 30 |
+
2. **Text Branch:**
|
| 31 |
+
- **MiniLM** (`sentence-transformers/all-MiniLM-L6-v2`): Encodes transcribed dialogue, narration, and VLM-generated descriptions.
|
| 32 |
+
3. **Compositional Branch:**
|
| 33 |
+
- A Multi-Layer Perceptron (MLP) encodes panel geometry (aspect ratio, normalized bounding box coordinates, relative area).
|
| 34 |
+
4. **Adaptive Fusion Gate:**
|
| 35 |
+
- A learned gating mechanism combines the Vision, Text, and Composition features, dynamically weighting them based on the presence/quality of the modalities (e.g., handles panels with no text gracefully).
|
| 36 |
+
|
| 37 |
+
## Training Data & Methodology
|
| 38 |
+
|
| 39 |
+
The model was trained on a dataset of approximately **1 million comic pages**, filtered specifically for narrative/story content using [CoSMo (Comic Stream Modeling)](https://github.com/mserra0/CoSMo-ComicsPSS).
|
| 40 |
+
|
| 41 |
+
### Objectives
|
| 42 |
+
The encoder was trained from scratch (with frozen base backbones) using three simultaneous objectives:
|
| 43 |
+
1. **InfoNCE Contrastive Loss (Global Context):** Maximizes similarity between panels on the *same page* while minimizing similarity to panels on *different pages*. This forces the model to learn distinct page-level stylistic and narrative contexts.
|
| 44 |
+
2. **Masked Panel Reconstruction (Local Detail):** Predicts the embedding of a masked panel given the context of surrounding panels on the same page. This prevents mode collapse and ensures individual panels retain their unique sequential features.
|
| 45 |
+
3. **Modality Alignment:** Aligns the visual embedding space with the text embedding space for a given panel using contrastive cross-entropy.
|
| 46 |
+
|
| 47 |
+
## Usage
|
| 48 |
+
|
| 49 |
+
You can use this model to extract 512-d embeddings from comic panels. The codebase required to run this model is available in the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) under `src/version2/stage3_panel_features_framework.py`.
|
| 50 |
+
|
| 51 |
+
### Example: Extracting Features
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
import torch
|
| 55 |
+
from PIL import Image
|
| 56 |
+
import torchvision.transforms as T
|
| 57 |
+
from transformers import AutoTokenizer
|
| 58 |
+
# Requires cloning the GitHub repo for the framework class
|
| 59 |
+
from stage3_panel_features_framework import PanelFeatureExtractor
|
| 60 |
+
|
| 61 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 62 |
+
|
| 63 |
+
# 1. Initialize Model
|
| 64 |
+
model = PanelFeatureExtractor(
|
| 65 |
+
visual_backbone='both',
|
| 66 |
+
visual_fusion='attention',
|
| 67 |
+
feature_dim=512
|
| 68 |
+
).to(device)
|
| 69 |
+
|
| 70 |
+
# Load weights from Hugging Face
|
| 71 |
+
state_dict = torch.hub.load_state_dict_from_url(
|
| 72 |
+
"https://huggingface.co/RichardScottOZ/comic-panel-encoder-v1/resolve/main/best_model.pt",
|
| 73 |
+
map_location=device
|
| 74 |
+
)
|
| 75 |
+
model.load_state_dict(state_dict)
|
| 76 |
+
model.eval()
|
| 77 |
+
|
| 78 |
+
# 2. Prepare Inputs
|
| 79 |
+
# Image
|
| 80 |
+
image = Image.open('sample_panel.jpg').convert('RGB')
|
| 81 |
+
transform = T.Compose([
|
| 82 |
+
T.Resize((224, 224)),
|
| 83 |
+
T.ToTensor(),
|
| 84 |
+
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
|
| 85 |
+
])
|
| 86 |
+
img_tensor = transform(image).unsqueeze(0).unsqueeze(0).to(device) # (B=1, N=1, C, H, W)
|
| 87 |
+
|
| 88 |
+
# Text
|
| 89 |
+
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
|
| 90 |
+
text_enc = tokenizer(["Batman punches the Joker"], return_tensors='pt', padding=True)
|
| 91 |
+
input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
|
| 92 |
+
attn_mask = text_enc['attention_mask'].unsqueeze(0).to(device)
|
| 93 |
+
|
| 94 |
+
# Composition (e.g., Aspect Ratio, Area, Center X, Center Y)
|
| 95 |
+
comp_feats = torch.zeros(1, 1, 7).to(device)
|
| 96 |
+
|
| 97 |
+
# Modality Mask [Vision, Text, Comp]
|
| 98 |
+
modality_mask = torch.tensor([[[1.0, 1.0, 1.0]]]).to(device)
|
| 99 |
+
|
| 100 |
+
batch = {
|
| 101 |
+
'images': img_tensor,
|
| 102 |
+
'input_ids': input_ids,
|
| 103 |
+
'attention_mask': attn_mask,
|
| 104 |
+
'comp_feats': comp_feats,
|
| 105 |
+
'modality_mask': modality_mask
|
| 106 |
+
}
|
| 107 |
+
|
| 108 |
+
# 3. Generate Embedding
|
| 109 |
+
with torch.no_grad():
|
| 110 |
+
panel_embedding = model(batch)
|
| 111 |
+
|
| 112 |
+
print(f"Embedding shape: {panel_embedding.shape}") # Output: torch.Size([1, 512])
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
## Intended Use & Limitations
|
| 116 |
+
- **Sequence Modeling:** These embeddings are intended to be fed into a temporal sequence model (like a Transformer encoder) to predict narrative flow, reading order, and character coherence (Stage 4 of the framework).
|
| 117 |
+
- **Retrieval:** Can be used to find visually or semantically similar panels across a large database using Cosine Similarity.
|
| 118 |
+
- **Limitation:** The visual backbones were frozen during training, meaning the model relies on the pre-trained priors of SigLIP and ResNet50, combined via the newly trained adapter and fusion layers.
|
| 119 |
+
|
| 120 |
+
## Citation
|
| 121 |
+
If you use this model or the associated framework, please link back to the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis).
|