| --- |
| license: apache-2.0 |
| tags: |
| - vision |
| - image-classification |
| datasets: |
| - imagenet-1k |
| widget: |
| - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg |
| example_title: Tiger |
| - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/teapot.jpg |
| example_title: Teapot |
| - src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg |
| example_title: Palace |
| --- |
| |
| # Convolutional Vision Transformer (CvT) |
|
|
| CvT-21 model pre-trained on ImageNet-1k at resolution 224x224. It was introduced in the paper [CvT: Introducing Convolutions to Vision Transformers](https://arxiv.org/abs/2103.15808) by Wu et al. and first released in [this repository](https://github.com/microsoft/CvT). |
|
|
| Disclaimer: The team releasing CvT did not write a model card for this model so this model card has been written by the Hugging Face team. |
|
|
| ## Usage |
|
|
| Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: |
|
|
| ```python |
| from transformers import AutoFeatureExtractor, CvtForImageClassification |
| from PIL import Image |
| import requests |
| |
| url = 'http://images.cocodataset.org/val2017/000000039769.jpg' |
| image = Image.open(requests.get(url, stream=True).raw) |
| |
| feature_extractor = AutoFeatureExtractor.from_pretrained('microsoft/cvt-21') |
| model = CvtForImageClassification.from_pretrained('microsoft/cvt-21') |
| |
| inputs = feature_extractor(images=image, return_tensors="pt") |
| outputs = model(**inputs) |
| logits = outputs.logits |
| # model predicts one of the 1000 ImageNet classes |
| predicted_class_idx = logits.argmax(-1).item() |
| print("Predicted class:", model.config.id2label[predicted_class_idx]) |
| ``` |
| ``` |