Science Cast

How (not) to ensemble LVLMs for VQA

Lisa AlazrakiOctober 11, 2023 10:30am

Views (29)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

How (not) to ensemble LVLMs for VQA

arXivPDFOctober 10, 2023 12:00am

Authors

Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, Thomas Mensink

Abstract

This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to models augmented with Lens-based retrieval of Wikipedia pages. Intuitively these models are highly complementary, which should make them ideal for ensembling. Indeed, an oracle experiment shows potential gains from 48.8% accuracy (the best single model) all the way up to 67% (best possible ensemble). So it is a trivial exercise to create an ensemble with substantial real gains. Or is it?

TwitterandLinkedIn

0 comments

Add comment

How (not) to ensemble LVLMs for VQA

How (not) to ensemble LVLMs for VQA

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments