--- license: mit --- --- language: en tags: - vision - text - multimodal - comics - contrastive-learning - feature-extraction license: mit --- # Comic Panel Encoder v1 (Stage 3) This model is a multimodal encoder specifically designed to generate rich, dense feature representations (embeddings) of individual comic book panels. It serves as "Stage 3" of the [Comic Analysis Framework v2.0](https://github.com/RichardScottOZ/Comic-Analysis). By combining visual details, extracted text (dialogue/narration), and compositional metadata (bounding box coordinates), it generates a single **512-dimensional vector** per panel. These embeddings are highly optimized for downstream sequential narrative modeling (Stage 4) and comic retrieval tasks. ## Model Architecture The `comic-panel-encoder-v1` utilizes an **Adaptive Multi-Modal Fusion** architecture: 1. **Visual Branch (Dual Backbone):** - **SigLIP** (`google/siglip-base-patch16-224`): Captures high-level semantic and stylistic features. - **ResNet50**: Captures fine-grained, low-level texture and structural details. - *Fusion:* An attention mechanism fuses the domain-adapted outputs of both backbones. 2. **Text Branch:** - **MiniLM** (`sentence-transformers/all-MiniLM-L6-v2`): Encodes transcribed dialogue, narration, and VLM-generated descriptions. 3. **Compositional Branch:** - A Multi-Layer Perceptron (MLP) encodes panel geometry (aspect ratio, normalized bounding box coordinates, relative area). 4. **Adaptive Fusion Gate:** - A learned gating mechanism combines the Vision, Text, and Composition features, dynamically weighting them based on the presence/quality of the modalities (e.g., handles panels with no text gracefully). ## Training Data & Methodology The model was trained on a dataset of approximately **1 million comic pages**, filtered specifically for narrative/story content using [CoSMo (Comic Stream Modeling)](https://github.com/mserra0/CoSMo-ComicsPSS). ### Objectives The encoder was trained from scratch (with frozen base backbones) using three simultaneous objectives: 1. **InfoNCE Contrastive Loss (Global Context):** Maximizes similarity between panels on the *same page* while minimizing similarity to panels on *different pages*. This forces the model to learn distinct page-level stylistic and narrative contexts. 2. **Masked Panel Reconstruction (Local Detail):** Predicts the embedding of a masked panel given the context of surrounding panels on the same page. This prevents mode collapse and ensures individual panels retain their unique sequential features. 3. **Modality Alignment:** Aligns the visual embedding space with the text embedding space for a given panel using contrastive cross-entropy. ## Usage You can use this model to extract 512-d embeddings from comic panels. The codebase required to run this model is available in the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) under `src/version2/stage3_panel_features_framework.py`. ### Example: Extracting Features ```python import torch from PIL import Image import torchvision.transforms as T from transformers import AutoTokenizer # Requires cloning the GitHub repo for the framework class from stage3_panel_features_framework import PanelFeatureExtractor device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # 1. Initialize Model model = PanelFeatureExtractor( visual_backbone='both', visual_fusion='attention', feature_dim=512 ).to(device) # Load weights from Hugging Face state_dict = torch.hub.load_state_dict_from_url( "https://huggingface.co/RichardScottOZ/comic-panel-encoder-v1/resolve/main/best_model.pt", map_location=device ) model.load_state_dict(state_dict) model.eval() # 2. Prepare Inputs # Image image = Image.open('sample_panel.jpg').convert('RGB') transform = T.Compose([ T.Resize((224, 224)), T.ToTensor(), T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) img_tensor = transform(image).unsqueeze(0).unsqueeze(0).to(device) # (B=1, N=1, C, H, W) # Text tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") text_enc = tokenizer(["Batman punches the Joker"], return_tensors='pt', padding=True) input_ids = text_enc['input_ids'].unsqueeze(0).to(device) attn_mask = text_enc['attention_mask'].unsqueeze(0).to(device) # Composition (e.g., Aspect Ratio, Area, Center X, Center Y) comp_feats = torch.zeros(1, 1, 7).to(device) # Modality Mask [Vision, Text, Comp] modality_mask = torch.tensor([[[1.0, 1.0, 1.0]]]).to(device) batch = { 'images': img_tensor, 'input_ids': input_ids, 'attention_mask': attn_mask, 'comp_feats': comp_feats, 'modality_mask': modality_mask } # 3. Generate Embedding with torch.no_grad(): panel_embedding = model(batch) print(f"Embedding shape: {panel_embedding.shape}") # Output: torch.Size([1, 512]) ``` ## Intended Use & Limitations - **Sequence Modeling:** These embeddings are intended to be fed into a temporal sequence model (like a Transformer encoder) to predict narrative flow, reading order, and character coherence (Stage 4 of the framework). - **Retrieval:** Can be used to find visually or semantically similar panels across a large database using Cosine Similarity. - **Limitation:** The visual backbones were frozen during training, meaning the model relies on the pre-trained priors of SigLIP and ResNet50, combined via the newly trained adapter and fusion layers. ## Citation If you use this model or the associated framework, please link back to the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis).