RichardScottOZ commited on
Commit
d86cb8e
·
verified ·
1 Parent(s): 7b32473

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -3
README.md CHANGED
@@ -1,6 +1,121 @@
1
  ---
2
  license: mit
3
  ---
4
- frozen SigLIP and ResNet50 backbone with a
5
- trained attention-fusion head, outputting a 512 dimensional feature vector, trained via InfoNCE contrastive loss on 1
6
- million comic panels.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ ---
5
+ language: en
6
+ tags:
7
+ - vision
8
+ - text
9
+ - multimodal
10
+ - comics
11
+ - contrastive-learning
12
+ - feature-extraction
13
+ license: mit
14
+ ---
15
+
16
+ # Comic Panel Encoder v1 (Stage 3)
17
+
18
+ This model is a multimodal encoder specifically designed to generate rich, dense feature representations (embeddings) of individual comic book panels. It serves as "Stage 3" of the [Comic Analysis Framework v2.0](https://github.com/RichardScottOZ/Comic-Analysis).
19
+
20
+ By combining visual details, extracted text (dialogue/narration), and compositional metadata (bounding box coordinates), it generates a single **512-dimensional vector** per panel. These embeddings are highly optimized for downstream sequential narrative modeling (Stage 4) and comic retrieval tasks.
21
+
22
+ ## Model Architecture
23
+
24
+ The `comic-panel-encoder-v1` utilizes an **Adaptive Multi-Modal Fusion** architecture:
25
+
26
+ 1. **Visual Branch (Dual Backbone):**
27
+ - **SigLIP** (`google/siglip-base-patch16-224`): Captures high-level semantic and stylistic features.
28
+ - **ResNet50**: Captures fine-grained, low-level texture and structural details.
29
+ - *Fusion:* An attention mechanism fuses the domain-adapted outputs of both backbones.
30
+ 2. **Text Branch:**
31
+ - **MiniLM** (`sentence-transformers/all-MiniLM-L6-v2`): Encodes transcribed dialogue, narration, and VLM-generated descriptions.
32
+ 3. **Compositional Branch:**
33
+ - A Multi-Layer Perceptron (MLP) encodes panel geometry (aspect ratio, normalized bounding box coordinates, relative area).
34
+ 4. **Adaptive Fusion Gate:**
35
+ - A learned gating mechanism combines the Vision, Text, and Composition features, dynamically weighting them based on the presence/quality of the modalities (e.g., handles panels with no text gracefully).
36
+
37
+ ## Training Data & Methodology
38
+
39
+ The model was trained on a dataset of approximately **1 million comic pages**, filtered specifically for narrative/story content using [CoSMo (Comic Stream Modeling)](https://github.com/mserra0/CoSMo-ComicsPSS).
40
+
41
+ ### Objectives
42
+ The encoder was trained from scratch (with frozen base backbones) using three simultaneous objectives:
43
+ 1. **InfoNCE Contrastive Loss (Global Context):** Maximizes similarity between panels on the *same page* while minimizing similarity to panels on *different pages*. This forces the model to learn distinct page-level stylistic and narrative contexts.
44
+ 2. **Masked Panel Reconstruction (Local Detail):** Predicts the embedding of a masked panel given the context of surrounding panels on the same page. This prevents mode collapse and ensures individual panels retain their unique sequential features.
45
+ 3. **Modality Alignment:** Aligns the visual embedding space with the text embedding space for a given panel using contrastive cross-entropy.
46
+
47
+ ## Usage
48
+
49
+ You can use this model to extract 512-d embeddings from comic panels. The codebase required to run this model is available in the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) under `src/version2/stage3_panel_features_framework.py`.
50
+
51
+ ### Example: Extracting Features
52
+
53
+ ```python
54
+ import torch
55
+ from PIL import Image
56
+ import torchvision.transforms as T
57
+ from transformers import AutoTokenizer
58
+ # Requires cloning the GitHub repo for the framework class
59
+ from stage3_panel_features_framework import PanelFeatureExtractor
60
+
61
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
62
+
63
+ # 1. Initialize Model
64
+ model = PanelFeatureExtractor(
65
+ visual_backbone='both',
66
+ visual_fusion='attention',
67
+ feature_dim=512
68
+ ).to(device)
69
+
70
+ # Load weights from Hugging Face
71
+ state_dict = torch.hub.load_state_dict_from_url(
72
+ "https://huggingface.co/RichardScottOZ/comic-panel-encoder-v1/resolve/main/best_model.pt",
73
+ map_location=device
74
+ )
75
+ model.load_state_dict(state_dict)
76
+ model.eval()
77
+
78
+ # 2. Prepare Inputs
79
+ # Image
80
+ image = Image.open('sample_panel.jpg').convert('RGB')
81
+ transform = T.Compose([
82
+ T.Resize((224, 224)),
83
+ T.ToTensor(),
84
+ T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
85
+ ])
86
+ img_tensor = transform(image).unsqueeze(0).unsqueeze(0).to(device) # (B=1, N=1, C, H, W)
87
+
88
+ # Text
89
+ tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
90
+ text_enc = tokenizer(["Batman punches the Joker"], return_tensors='pt', padding=True)
91
+ input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
92
+ attn_mask = text_enc['attention_mask'].unsqueeze(0).to(device)
93
+
94
+ # Composition (e.g., Aspect Ratio, Area, Center X, Center Y)
95
+ comp_feats = torch.zeros(1, 1, 7).to(device)
96
+
97
+ # Modality Mask [Vision, Text, Comp]
98
+ modality_mask = torch.tensor([[[1.0, 1.0, 1.0]]]).to(device)
99
+
100
+ batch = {
101
+ 'images': img_tensor,
102
+ 'input_ids': input_ids,
103
+ 'attention_mask': attn_mask,
104
+ 'comp_feats': comp_feats,
105
+ 'modality_mask': modality_mask
106
+ }
107
+
108
+ # 3. Generate Embedding
109
+ with torch.no_grad():
110
+ panel_embedding = model(batch)
111
+
112
+ print(f"Embedding shape: {panel_embedding.shape}") # Output: torch.Size([1, 512])
113
+ ```
114
+
115
+ ## Intended Use & Limitations
116
+ - **Sequence Modeling:** These embeddings are intended to be fed into a temporal sequence model (like a Transformer encoder) to predict narrative flow, reading order, and character coherence (Stage 4 of the framework).
117
+ - **Retrieval:** Can be used to find visually or semantically similar panels across a large database using Cosine Similarity.
118
+ - **Limitation:** The visual backbones were frozen during training, meaning the model relies on the pre-trained priors of SigLIP and ResNet50, combined via the newly trained adapter and fusion layers.
119
+
120
+ ## Citation
121
+ If you use this model or the associated framework, please link back to the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis).