AI & ML interests
VLM, Geo Reasoning, Visual Retriving
Recent Activity
Organization Card
TLV R&D VLMs for Image Retrieving and Visual Reasoning
Vision-Language Retrieving Models
| Model Name | Model Type | Base Model | Training Set | Owner | Link | Freezed Parameters |
|---|---|---|---|---|---|---|
| ImiClip | CLIP | openai/clip-vit-base-patch32 | DM | Etzion | TLVLM/ImiClip | Vision Encoder |
| ImiClip_v2 | CLIP | openai/clip-vit-base-patch32 | DM + RSICD | Etzion | TLVLM/ImiClip_v2 | Vision Encoder |
| ImiClip_v3 | CLIP | openai/clip-vit-base-patch32 | DM + RSICD | Etzion | TLVLM/ImiClip_v3 | ❌ |
| ImiGlip | SigLIP | google/siglip-so400m-patch14-384 | DM | Etzion | TLVLM/ImiGlip | Vision Encoder |
| ImiGlip_V2 | SigLIP | google/siglip-so400m-patch14-384 | DM + RSICD | Etzion | TLVLM/ImiGlip_V2 | Vision Encoder |
| ImiGlip_V3 | SigLIP | google/siglip-so400m-patch14-384 | DM + RSICD | Etzion | TLVLM/ImiGlip_V3 | ❌ |
| ImiGlip2 | SigLIP2 | google/siglip2-so400m-patch14-384 | DM + RSICD | Etzion | TLVLM/ImiGlip2 | Both Encoders + Logits |
| ImiGlip2n | SigLIP2 | google/siglip2-so400m-patch16-naflex | DM + RSICD | Etzion | TLVLM/ImiGlip2n | Both Encoders + Logits |
Runtime
| Model Type | Base Model | Time per Single Text | Time per Single Image | Time per 10,000 Texts | Time per 10,000 Images |
|---|---|---|---|---|---|
| CLIP | openai/clip-vit-base-patch32 | 0.0129 | 0.0101 | 129.4 | 100.8 |
| SigLIP (1+2) | google/siglip-so400m-patch14-384 | 0.0578 | 0.0189 | 577.5 | 188.9 |
| SigLIP2n | google/siglip2-so400m-patch16-naflex | 0.0257 | 0.0189 | 257.0 | 188.6 |
Important notes:
- Time reported in seconds.
- All the calculation conduct on NVIDIA A40 GPU
- Avr. Text length: 633±93 Characters
- Avr. Image size: $536^2$ Pixels
Collections
Here you can find the model Collections
- CLIP based finetuned models: TLVLM/clips
- SigLIP based finetuned models: TLVLM/siglips
- SigLIP 2 based finetuned models: TLVLM/siglips2
models 0
None public yet
datasets 0
None public yet