JackIsNotInTheBox commited on
Commit
6aaa930
·
verified ·
1 Parent(s): f0ba17a

Mirror README.md from google/siglip2-base-patch16-512@a89f5c50

Browse files
encoders/google/siglip2-base-patch16-512/README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision
5
+ widget:
6
+ - src: >-
7
+ https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg
8
+ candidate_labels: bee in the sky, bee on the flower
9
+ example_title: Bee
10
+ library_name: transformers
11
+ pipeline_tag: zero-shot-image-classification
12
+ ---
13
+
14
+ # SigLIP 2 Base
15
+
16
+ [SigLIP 2](https://huggingface.co/papers/2502.14786) extends the pretraining objective of
17
+ [SigLIP](https://huggingface.co/papers/2303.15343) with prior, independently developed techniques
18
+ into a unified recipe, for improved semantic understanding, localization, and dense features.
19
+
20
+ ## Intended uses
21
+
22
+ You can use the raw model for tasks like zero-shot image classification and
23
+ image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).
24
+
25
+ Here is how to use this model to perform zero-shot image classification:
26
+
27
+ ```python
28
+ from transformers import pipeline
29
+
30
+ # load pipeline
31
+ ckpt = "google/siglip2-base-patch16-512"
32
+ image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")
33
+
34
+ # load image and candidate labels
35
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
36
+ candidate_labels = ["2 cats", "a plane", "a remote"]
37
+
38
+ # run inference
39
+ outputs = image_classifier(image, candidate_labels)
40
+ print(outputs)
41
+ ```
42
+
43
+ You can encode an image using the Vision Tower like so:
44
+
45
+ ```python
46
+ import torch
47
+ from transformers import AutoModel, AutoProcessor
48
+ from transformers.image_utils import load_image
49
+
50
+ # load the model and processor
51
+ ckpt = "google/siglip2-base-patch16-512"
52
+ model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
53
+ processor = AutoProcessor.from_pretrained(ckpt)
54
+
55
+ # load the image
56
+ image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
57
+ inputs = processor(images=[image], return_tensors="pt").to(model.device)
58
+
59
+ # run infernece
60
+ with torch.no_grad():
61
+ image_embeddings = model.get_image_features(**inputs)
62
+
63
+ print(image_embeddings.shape)
64
+ ```
65
+
66
+ For more code examples, we refer to the [siglip documentation](https://huggingface.co/transformers/main/model_doc/siglip.html#).
67
+
68
+ ## Training procedure
69
+
70
+ SigLIP 2 adds some clever training objectives on top of SigLIP:
71
+
72
+ 1. Decoder loss
73
+ 2. Global-local and masked prediction loss
74
+ 3. Aspect ratio and resolution adaptibility
75
+
76
+ ### Training data
77
+
78
+ SigLIP 2 is pre-trained on the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794).
79
+
80
+ ### Compute
81
+
82
+ The model was trained on up to 2048 TPU-v5e chips.
83
+
84
+ ## Evaluation results
85
+
86
+ Evaluation of SigLIP 2 is shown below (taken from the paper).
87
+
88
+ ![Evaluation Table](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/sg2-blog/eval_table.png)
89
+
90
+ ### BibTeX entry and citation info
91
+
92
+ ```bibtex
93
+ @misc{tschannen2025siglip2multilingualvisionlanguage,
94
+ title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
95
+ author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
96
+ year={2025},
97
+ eprint={2502.14786},
98
+ archivePrefix={arXiv},
99
+ primaryClass={cs.CV},
100
+ url={https://arxiv.org/abs/2502.14786},
101
+ }
102
+ ```