RichardScottOZ
/

comic-panel-encoder-v1

Model card Files Files and versions

xet

Community

RichardScottOZ commited on Mar 1

Commit

d86cb8e

verified ·

1 Parent(s): 7b32473

Update README.md

Browse files

Files changed (1) hide show

README.md +118 -3

README.md CHANGED Viewed

@@ -1,6 +1,121 @@
 ---
 license: mit
 ---
- frozen SigLIP and ResNet50 backbone with a
-  trained attention-fusion head, outputting a 512 dimensional feature vector, trained via InfoNCE contrastive loss on 1
-  million comic panels.

 ---
 license: mit
 ---
+---
+language: en
+tags:
+- vision
+- text
+- multimodal
+- comics
+- contrastive-learning
+- feature-extraction
+license: mit
+---
+# Comic Panel Encoder v1 (Stage 3)
+This model is a multimodal encoder specifically designed to generate rich, dense feature representations (embeddings) of individual comic book panels. It serves as "Stage 3" of the [Comic Analysis Framework v2.0](https://github.com/RichardScottOZ/Comic-Analysis).
+By combining visual details, extracted text (dialogue/narration), and compositional metadata (bounding box coordinates), it generates a single **512-dimensional vector** per panel. These embeddings are highly optimized for downstream sequential narrative modeling (Stage 4) and comic retrieval tasks.
+## Model Architecture
+The `comic-panel-encoder-v1` utilizes an **Adaptive Multi-Modal Fusion** architecture:
+1. **Visual Branch (Dual Backbone):**
+   - **SigLIP** (`google/siglip-base-patch16-224`): Captures high-level semantic and stylistic features.
+   - **ResNet50**: Captures fine-grained, low-level texture and structural details.
+   - *Fusion:* An attention mechanism fuses the domain-adapted outputs of both backbones.
+2. **Text Branch:**
+   - **MiniLM** (`sentence-transformers/all-MiniLM-L6-v2`): Encodes transcribed dialogue, narration, and VLM-generated descriptions.
+3. **Compositional Branch:**
+   - A Multi-Layer Perceptron (MLP) encodes panel geometry (aspect ratio, normalized bounding box coordinates, relative area).
+4. **Adaptive Fusion Gate:**
+   - A learned gating mechanism combines the Vision, Text, and Composition features, dynamically weighting them based on the presence/quality of the modalities (e.g., handles panels with no text gracefully).
+## Training Data & Methodology
+The model was trained on a dataset of approximately **1 million comic pages**, filtered specifically for narrative/story content using [CoSMo (Comic Stream Modeling)](https://github.com/mserra0/CoSMo-ComicsPSS).
+### Objectives
+The encoder was trained from scratch (with frozen base backbones) using three simultaneous objectives:
+1. **InfoNCE Contrastive Loss (Global Context):** Maximizes similarity between panels on the *same page* while minimizing similarity to panels on *different pages*. This forces the model to learn distinct page-level stylistic and narrative contexts.
+2. **Masked Panel Reconstruction (Local Detail):** Predicts the embedding of a masked panel given the context of surrounding panels on the same page. This prevents mode collapse and ensures individual panels retain their unique sequential features.
+3. **Modality Alignment:** Aligns the visual embedding space with the text embedding space for a given panel using contrastive cross-entropy.
+## Usage
+You can use this model to extract 512-d embeddings from comic panels. The codebase required to run this model is available in the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) under `src/version2/stage3_panel_features_framework.py`.
+### Example: Extracting Features
+```python
+import torch
+from PIL import Image
+import torchvision.transforms as T
+from transformers import AutoTokenizer
+# Requires cloning the GitHub repo for the framework class
+from stage3_panel_features_framework import PanelFeatureExtractor
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# 1. Initialize Model
+model = PanelFeatureExtractor(
+    visual_backbone='both',
+    visual_fusion='attention',
+    feature_dim=512
+).to(device)
+# Load weights from Hugging Face
+state_dict = torch.hub.load_state_dict_from_url(
+    "https://huggingface.co/RichardScottOZ/comic-panel-encoder-v1/resolve/main/best_model.pt",
+    map_location=device
+)
+model.load_state_dict(state_dict)
+model.eval()
+# 2. Prepare Inputs
+# Image
+image = Image.open('sample_panel.jpg').convert('RGB')
+transform = T.Compose([
+    T.Resize((224, 224)),
+    T.ToTensor(),
+    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+])
+img_tensor = transform(image).unsqueeze(0).unsqueeze(0).to(device) # (B=1, N=1, C, H, W)
+# Text
+tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
+text_enc = tokenizer(["Batman punches the Joker"], return_tensors='pt', padding=True)
+input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
+attn_mask = text_enc['attention_mask'].unsqueeze(0).to(device)
+# Composition (e.g., Aspect Ratio, Area, Center X, Center Y)
+comp_feats = torch.zeros(1, 1, 7).to(device)
+# Modality Mask [Vision, Text, Comp]
+modality_mask = torch.tensor([[[1.0, 1.0, 1.0]]]).to(device)
+batch = {
+    'images': img_tensor,
+    'input_ids': input_ids,
+    'attention_mask': attn_mask,
+    'comp_feats': comp_feats,
+    'modality_mask': modality_mask
+}
+# 3. Generate Embedding
+with torch.no_grad():
+    panel_embedding = model(batch)
+print(f"Embedding shape: {panel_embedding.shape}") # Output: torch.Size([1, 512])
+```
+## Intended Use & Limitations
+- **Sequence Modeling:** These embeddings are intended to be fed into a temporal sequence model (like a Transformer encoder) to predict narrative flow, reading order, and character coherence (Stage 4 of the framework).
+- **Retrieval:** Can be used to find visually or semantically similar panels across a large database using Cosine Similarity.
+- **Limitation:** The visual backbones were frozen during training, meaning the model relies on the pre-trained priors of SigLIP and ResNet50, combined via the newly trained adapter and fusion layers.
+## Citation
+If you use this model or the associated framework, please link back to the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis).