| | --- |
| | license: mit |
| | language: |
| | - en |
| | base_model: |
| | - facebook/dinov2-base |
| | - facebook/dinov2-small |
| | tags: |
| | - computer_vision |
| | --- |
| | |
| | # Near, far: Patch-ordering enhances vision foundation models' scene understanding |
| |
|
| | Welcome to the Hugging Face repository for **NeCo**. an adapted vision encoder that captures fine-grained details and structural information essential for performing key-point matching, semantic segmentation and more. This repository hosts pretrained checkpoints for NeCo, enabling easy integration into your projects. |
| |
|
| | Our paper discussing our work: |
| | **"Near, far: Patch-ordering enhances vision foundation models' scene understanding"** |
| | *[Valentinos Pariza](https://vpariza.github.io), [Mohammadreza Salehi](https://smsd75.github.io),[Gertjan J. Burghouts](https://gertjanburghouts.github.io), [Francesco Locatello](https://www.francescolocatello.com/), [Yuki M. Asano](yukimasano.github.io)* |
| |
|
| | 🌐 **[Project Page](https://vpariza.github.io/NeCo/)** |
| | ⌨️ **[GitHub Repository](https://github.com/vpariza/NeCo)** |
| | 📄 **[Read the Paper on arXiv](https://arxiv.org/abs/2408.11054)** |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | NeCo introduces a new self-supervised learning technique for enhancing spatial representations in vision transformers. By leveraging Patch Neighbor Consistency, NeCo captures fine-grained details and structural information that are crucial for various downstream tasks, such as semantic segmentation. |
| |
|
| | - **Model type:** Vision Encoder (Dino, Dinov2, ...) |
| | - **Language(s) (NLP):** Python |
| | - **License:** MIT |
| | - **Finetuned from model [optional]:** Dinov2, Dinov2R, Dino, ... |
| |
|
| |
|
| | ## How to Get Started with the Model |
| |
|
| | To use NeCo models on downstream dense prediction tasks, you just need to install `timm` and `torch` and depending on which checkpoint you use you can load it as follows: |
| |
|
| | The models can be download from our [NeCo Hugging Face repo](https://huggingface.co/FunAILab/NeCo/tree/main). |
| |
|
| | #### Models after post-training dinov2 (following dinov2 architecture) |
| |
|
| | ##### NeCo on Dinov2 |
| | ```python |
| | import torch |
| | # change to dinov2_vitb14 for base as described in: |
| | # https://github.com/facebookresearch/dinov2 |
| | model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14') |
| | path_to_checkpoint = "<your path to downloaded ckpt>" |
| | state_dict = torch.load(path_to_checkpoint) |
| | model.load_state_dict(state_dict, strict=False) |
| | ``` |
| | ##### NeCo on Dinov2 with Registers |
| | ```python |
| | import torch |
| | # change to dinov2_vitb14_reg for base as described in: |
| | # https://github.com/facebookresearch/dinov2 |
| | model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg') |
| | path_to_checkpoint = "<your path to downloaded ckpt>" |
| | state_dict = torch.load(path_to_checkpoint) |
| | model.load_state_dict(state_dict, strict=False) |
| | ``` |
| | #### Models after post-training dino or similar (following dino architecture) |
| | ##### timm vit-small and vit-base architectures |
| | ```python |
| | import torch |
| | from timm.models.vision_transformer import vit_small_patch16_224, vit_base_patch16_224 |
| | # Change to vit_base_patch8_224() if you want to use our larger model |
| | model = vit_small_patch16_224() |
| | path_to_checkpoint = "<your path to downloaded ckpt>" |
| | state_dict = torch.load(path_to_checkpoint, map_location='cpu') |
| | model.load_state_dict(state_dict, strict=False) |
| | ``` |
| |
|
| | **Note:** In case you want to directly load the weights of the model from a hugging face url, please execute: |
| | ```python |
| | import torch |
| | state_dict = torch.hub.load_state_dict_from_url("<url to the hugging face checkpoint>") |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | * We have post-trained our models on the **COCO Dataset**. |
| |
|
| | ### Training Procedure |
| |
|
| | Please look our repository and read our paper for more details. |
| |
|
| | ## Environmental Impact |
| | - **Hardware Type:** NVIDIA A100 GPU |
| | - **Hours used:** 18 (per model) |
| | - **Cloud Provider:** Helma NHR FAU (Germany), (Snellius The Netherlands) |
| | - **Compute Region:** Europe/Germany & Netherlands |
| |
|
| | ## Citation |
| |
|
| | **BibTeX:** |
| | ``` |
| | @inproceedings{ |
| | pariza2025near, |
| | title={Near, far: Patch-ordering enhances vision foundation models' scene understanding}, |
| | author={Valentinos Pariza and Mohammadreza Salehi and Gertjan J. Burghouts and Francesco Locatello and Yuki M Asano}, |
| | booktitle={The Thirteenth International Conference on Learning Representations}, |
| | year={2025}, |
| | url={https://openreview.net/forum?id=Qro97zWC29} |
| | } |
| | |
| | ``` |
| |
|
| | <!-- **APA:** --> |
| |
|
| | <!-- [More Information Needed] --> |
| |
|
| | <!-- ## Glossary [optional] --> |
| |
|
| | <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. --> |
| |
|
| | <!-- [More Information Needed] |
| |
|
| | ## More Information [optional] |
| |
|
| | [More Information Needed] |
| |
|
| | ## Model Card Authors [optional] |
| |
|
| | [More Information Needed] |
| |
|
| | ## Model Card Contact |
| |
|
| | [More Information Needed] --> |