---
license: mit
---
---
language: en
tags:
- vision
- text
- multimodal
- comics
- contrastive-learning
- feature-extraction
license: mit
---

# Comic Panel Encoder v1 (Stage 3)

This model is a multimodal encoder specifically designed to generate rich, dense feature representations (embeddings) of individual comic book panels. It serves as "Stage 3" of the [Comic Analysis Framework v2.0](https://github.com/RichardScottOZ/Comic-Analysis).

By combining visual details, extracted text (dialogue/narration), and compositional metadata (bounding box coordinates), it generates a single **512-dimensional vector** per panel. These embeddings are highly optimized for downstream sequential narrative modeling (Stage 4) and comic retrieval tasks.

## Model Architecture

The `comic-panel-encoder-v1` utilizes an **Adaptive Multi-Modal Fusion** architecture:

1. **Visual Branch (Dual Backbone):**
   - **SigLIP** (`google/siglip-base-patch16-224`): Captures high-level semantic and stylistic features.
   - **ResNet50**: Captures fine-grained, low-level texture and structural details.
   - *Fusion:* An attention mechanism fuses the domain-adapted outputs of both backbones.
2. **Text Branch:**
   - **MiniLM** (`sentence-transformers/all-MiniLM-L6-v2`): Encodes transcribed dialogue, narration, and VLM-generated descriptions.
3. **Compositional Branch:**
   - A Multi-Layer Perceptron (MLP) encodes panel geometry (aspect ratio, normalized bounding box coordinates, relative area).
4. **Adaptive Fusion Gate:**
   - A learned gating mechanism combines the Vision, Text, and Composition features, dynamically weighting them based on the presence/quality of the modalities (e.g., handles panels with no text gracefully).

## Training Data & Methodology

The model was trained on a dataset of approximately **1 million comic pages**, filtered specifically for narrative/story content using [CoSMo (Comic Stream Modeling)](https://github.com/mserra0/CoSMo-ComicsPSS).

### Objectives
The encoder was trained from scratch (with frozen base backbones) using three simultaneous objectives:
1. **InfoNCE Contrastive Loss (Global Context):** Maximizes similarity between panels on the *same page* while minimizing similarity to panels on *different pages*. This forces the model to learn distinct page-level stylistic and narrative contexts.
2. **Masked Panel Reconstruction (Local Detail):** Predicts the embedding of a masked panel given the context of surrounding panels on the same page. This prevents mode collapse and ensures individual panels retain their unique sequential features.
3. **Modality Alignment:** Aligns the visual embedding space with the text embedding space for a given panel using contrastive cross-entropy.

## Usage

You can use this model to extract 512-d embeddings from comic panels. The codebase required to run this model is available in the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) under `src/version2/stage3_panel_features_framework.py`.

### Example: Extracting Features

```python
import torch
from PIL import Image
import torchvision.transforms as T
from transformers import AutoTokenizer
# Requires cloning the GitHub repo for the framework class
from stage3_panel_features_framework import PanelFeatureExtractor

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Initialize Model
model = PanelFeatureExtractor(
    visual_backbone='both',
    visual_fusion='attention',
    feature_dim=512
).to(device)

# Load weights from Hugging Face
state_dict = torch.hub.load_state_dict_from_url(
    "https://huggingface.co/RichardScottOZ/comic-panel-encoder-v1/resolve/main/best_model.pt",
    map_location=device
)
model.load_state_dict(state_dict)
model.eval()

# 2. Prepare Inputs
# Image
image = Image.open('sample_panel.jpg').convert('RGB')
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
img_tensor = transform(image).unsqueeze(0).unsqueeze(0).to(device) # (B=1, N=1, C, H, W)

# Text
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
text_enc = tokenizer(["Batman punches the Joker"], return_tensors='pt', padding=True)
input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
attn_mask = text_enc['attention_mask'].unsqueeze(0).to(device)

# Composition (e.g., Aspect Ratio, Area, Center X, Center Y)
comp_feats = torch.zeros(1, 1, 7).to(device) 

# Modality Mask [Vision, Text, Comp]
modality_mask = torch.tensor([[[1.0, 1.0, 1.0]]]).to(device)

batch = {
    'images': img_tensor,
    'input_ids': input_ids,
    'attention_mask': attn_mask,
    'comp_feats': comp_feats,
    'modality_mask': modality_mask
}

# 3. Generate Embedding
with torch.no_grad():
    panel_embedding = model(batch)

print(f"Embedding shape: {panel_embedding.shape}") # Output: torch.Size([1, 512])
```

## Intended Use & Limitations
- **Sequence Modeling:** These embeddings are intended to be fed into a temporal sequence model (like a Transformer encoder) to predict narrative flow, reading order, and character coherence (Stage 4 of the framework).
- **Retrieval:** Can be used to find visually or semantically similar panels across a large database using Cosine Similarity.
- **Limitation:** The visual backbones were frozen during training, meaning the model relies on the pre-trained priors of SigLIP and ResNet50, combined via the newly trained adapter and fusion layers.

## Citation
If you use this model or the associated framework, please link back to the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis).