| Model Open Ended VQA: % Human Rating Multiple Choice VQA: % Accuracy Hints-Multiple Choice VQA: % Accuracy Attributions-Multiple Choice VQA: % Accuracy Refernce Based-Automatic Evaluation: Accuracy of Judge Prediction Compared to Human Ratings Refernce Free-Automatic Evaluation: Accuracy of Judge Prediction Compared to Human Ratings Automatic Evaluation: % Auto-Rater Ratings Hints-Automatic Evaluation: % Auto-Rater Ratings Attributions-Automatic Evaluation: % Auto-Rater Ratings | |
| Humans 82 78 | |
| Gemini Pro 1.5 40 38 66 72 87 52 53 62 29 | |
| Gemini Pro Vision 30 41 62 75 38 34 47 | |
| GPT4 34 45 69 82 86 51 38 61 25 | |
| LlaVA-1.6-34B 15 24 30 76 43 21 16 | |
| LlaVA-1.5-7B 13 17 29 70 35 19 30 | |
| InstructBlip 13 20 28 | |
| Gemini Pro 1.5 Caption _ Gemini Pro 1.5 23 | |
| Human (Oracle) Caption _ Gemini Pro 1.5 50 | |
| Claude 3.5 Sonnet 46 45 39 | |
| GPT4o 55 83 50 | |
| Qwen-VL-Max 35 53 26 | |
| Molmo-7B 34 42 36 |