nielsr HF Staff commited on
Commit
2cf6f9c
ยท
verified ยท
1 Parent(s): 4bc248b

Improve model card: Add pipeline tag, library name, abstract, results, and GitHub link

Browse files

This PR significantly enhances the model card by:

- Adding `pipeline_tag: video-text-to-text` to the metadata, which helps in categorizing the model on the Hub for easier discovery under the appropriate pipeline (https://huggingface.co/models?pipeline_tag=video-text-to-text).
- Including `library_name: transformers` in the metadata, indicating compatibility with the ๐Ÿค— Transformers library and enabling an automated "How to use" widget on the model page.
- Adding a direct link to the GitHub repository: https://github.com/BASHLab/RAVEN.
- Incorporating the paper's abstract for better context.
- Adding the "Main Results", "AVS-QA Dataset", and "Model Zoo" sections from the official GitHub README, providing crucial information about performance, data, and available variants.
- Renaming the "Inference" section to "Sample Usage" for clarity and consistency with Hugging Face conventions, while retaining the original code snippet.

Please review and merge this PR if these changes are appropriate.

Files changed (1) hide show
  1. README.md +31 -2
README.md CHANGED
@@ -1,6 +1,9 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
4
  <p align="center">
5
  <img src="./assets/raven_logo.png" width="100" style="margin-bottom: 0.2;"/>
6
  <p>
@@ -16,13 +19,33 @@ license: apache-2.0
16
  <a href="https://bashlab.github.io/raven_project/" style="color:#825987">
17
  https://bashlab.github.io/raven_project/
18
  </a>
 
 
 
 
19
  </h5>
20
  <p align="center">
21
  <img src="./assets/raven_architecture.png" width="800" />
22
  <p>
23
 
24
  ---
25
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ## ๐Ÿ› ๏ธ Requirements and Installation
27
  Basic Dependencies:
28
  * Python >= 3.8
@@ -40,7 +63,13 @@ apt-get update && apt-get install ffmpeg libsm6 libxext6 -y
40
  ```
41
  ---
42
 
43
- ## ๐Ÿค– Inference
 
 
 
 
 
 
44
  - **STEP 1:** Download $\texttt{siglip-so400m-patch14-384}$ from here [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
45
  - **STEP 2:** Download **RAVEN** checkpoint
46
  ```bash
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: video-text-to-text
4
+ library_name: transformers
5
  ---
6
+
7
  <p align="center">
8
  <img src="./assets/raven_logo.png" width="100" style="margin-bottom: 0.2;"/>
9
  <p>
 
19
  <a href="https://bashlab.github.io/raven_project/" style="color:#825987">
20
  https://bashlab.github.io/raven_project/
21
  </a>
22
+ &bull; Code:
23
+ <a href="https://github.com/BASHLab/RAVEN" style="color:#825987">
24
+ https://github.com/BASHLab/RAVEN
25
+ </a>
26
  </h5>
27
  <p align="center">
28
  <img src="./assets/raven_architecture.png" width="800" />
29
  <p>
30
 
31
  ---
32
+
33
+ ## Abstract
34
+ Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5% and 8.0% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23%. Our code and dataset are available at this https URL .
35
+
36
+ ---
37
+ ## ๐Ÿš€ Main Results
38
+ ##### Comparison of **RAVEN** and prior MLLMs on *exocentric* open-ended video QA (MSVD-QA, MSRVTT-QA, ActivityNet-QA) and audio-visual QA (AVSD, MUSIC-QA) benchmarks. Best and second-best scores are in $\textbf{Bold}$ and $\underline{\text{underline}}$. $^*$ indicates scores reproduced by us.
39
+ <p><img src="./assets/main_result_exo.png" width="800"></p>
40
+
41
+ ##### Comparison of **RAVEN** with MLLMs on the EgoThink (Reasoning) and AVS-QA benchmarks. **RAVEN** outperforms across metrics and excels in reasoning. $\textbf{Bold}$ and $\underline{\text{underline}}$ indicate the best and second-best scores.
42
+ <p><img src="./assets/main_result_ego.png" width="800"></p>
43
+
44
+ ---
45
+ ## ๐Ÿ“ **AVS-QA** Dataset
46
+ Train and test split of **AVS-QA** is provided [here](./avs-qa-dataset/).<br>
47
+ More details [here](./avs-qa-dataset/README.md).
48
+
49
  ## ๐Ÿ› ๏ธ Requirements and Installation
50
  Basic Dependencies:
51
  * Python >= 3.8
 
63
  ```
64
  ---
65
 
66
+ ## ๐Ÿ€ Model Zoo
67
+ | Model Name | Modal Type |
68
+ |:----------------|:------------:|
69
+ | [RAVEN-7B-AV](https://huggingface.co/BASH-Lab/RAVEN-AV-7B)| AV |
70
+ | RAVEN-7B-AVS| AVS |
71
+
72
+ ## ๐Ÿค– Sample Usage
73
  - **STEP 1:** Download $\texttt{siglip-so400m-patch14-384}$ from here [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
74
  - **STEP 2:** Download **RAVEN** checkpoint
75
  ```bash