Improve model card: Add pipeline tag, library name, abstract, results, and GitHub link

This PR significantly enhances the model card by:

- Adding `pipeline_tag: video-text-to-text` to the metadata, which helps in categorizing the model on the Hub for easier discovery under the appropriate pipeline (https://huggingface.co/models?pipeline_tag=video-text-to-text).
- Including `library_name: transformers` in the metadata, indicating compatibility with the 🤗 Transformers library and enabling an automated "How to use" widget on the model page.
- Adding a direct link to the GitHub repository: https://github.com/BASHLab/RAVEN.
- Incorporating the paper's abstract for better context.
- Adding the "Main Results", "AVS-QA Dataset", and "Model Zoo" sections from the official GitHub README, providing crucial information about performance, data, and available variants.
- Renaming the "Inference" section to "Sample Usage" for clarity and consistency with Hugging Face conventions, while retaining the original code snippet.

Please review and merge this PR if these changes are appropriate.

Files changed (1) hide show

README.md +31 -2

README.md CHANGED Viewed

@@ -1,6 +1,9 @@
 ---
 license: apache-2.0
 ---
 <p align="center">
     <img src="./assets/raven_logo.png" width="100" style="margin-bottom: 0.2;"/>
 <p>
@@ -16,13 +19,33 @@ license: apache-2.0
     <a href="https://bashlab.github.io/raven_project/" style="color:#825987">
         https://bashlab.github.io/raven_project/
     </a>
 </h5>
 <p align="center">
   <img src="./assets/raven_architecture.png" width="800" />
 <p>
 ---
 ## 🛠️ Requirements and Installation
 Basic Dependencies:
 * Python >= 3.8
@@ -40,7 +63,13 @@ apt-get update && apt-get install ffmpeg libsm6 libxext6  -y
 ```
 ---
-## 🤖 Inference
 - **STEP 1:** Download $\texttt{siglip-so400m-patch14-384}$ from here [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
 - **STEP 2:** Download **RAVEN** checkpoint
 ```bash

 ---
 license: apache-2.0
+pipeline_tag: video-text-to-text
+library_name: transformers
 ---
 <p align="center">
     <img src="./assets/raven_logo.png" width="100" style="margin-bottom: 0.2;"/>
 <p>
     <a href="https://bashlab.github.io/raven_project/" style="color:#825987">
         https://bashlab.github.io/raven_project/
     </a>
+    &bull; Code:
+    <a href="https://github.com/BASHLab/RAVEN" style="color:#825987">
+        https://github.com/BASHLab/RAVEN
+    </a>
 </h5>
 <p align="center">
   <img src="./assets/raven_architecture.png" width="800" />
 <p>
 ---
+## Abstract
+Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at this https URL .
+---
+## 🚀 Main Results
+##### Comparison of **RAVEN** and prior MLLMs on *exocentric* open-ended video QA (MSVD-QA, MSRVTT-QA, ActivityNet-QA) and audio-visual QA (AVSD, MUSIC-QA) benchmarks. Best and second-best scores are in $\textbf{Bold}$ and $\underline{\text{underline}}$. $^*$ indicates scores reproduced by us.
+<p><img src="./assets/main_result_exo.png" width="800"></p>
+##### Comparison of **RAVEN** with MLLMs on the EgoThink (Reasoning) and AVS-QA benchmarks. **RAVEN** outperforms across metrics and excels in reasoning. $\textbf{Bold}$ and $\underline{\text{underline}}$ indicate the best and second-best scores.
+<p><img src="./assets/main_result_ego.png" width="800"></p>
+---
+## 📁 **AVS-QA** Dataset
+Train and test split of **AVS-QA** is provided [here](./avs-qa-dataset/).<br>
+More details [here](./avs-qa-dataset/README.md).
 ## 🛠️ Requirements and Installation
 Basic Dependencies:
 * Python >= 3.8
 ```
 ---
+## 🍀 Model Zoo
+| Model Name     | Modal Type |
+|:----------------|:------------:|
+| [RAVEN-7B-AV](https://huggingface.co/BASH-Lab/RAVEN-AV-7B)| AV |
+| RAVEN-7B-AVS| AVS |
+## 🤖 Sample Usage
 - **STEP 1:** Download $\texttt{siglip-so400m-patch14-384}$ from here [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
 - **STEP 2:** Download **RAVEN** checkpoint
 ```bash