Image captioning for effective use of language models in knowledge-based visual question answering

Salaberria Saizar, Ander; Azkune Galparsoro, Gorka; López de Lacalle Lecuona, Oier; Soroa Echave, Aitor; Agirre Bengoa, Eneko

Ikusi/Ireki

Artículo (1.969Mb)

Data

2023-02

Egilea

Salaberria Saizar, Ander

Azkune Galparsoro, Gorka

López de Lacalle Lecuona, Oier

Soroa Echave, Aitor

Agirre Bengoa, Eneko

Metadata

Itemaren erregistro osoa erakusten du

Estadisticas en RECOLECTA
(LA Referencia)

Expert Systems with Applications 212 : (2023) // Article ID 118669

URI

http://hdl.handle.net/10810/59873

Laburpena

Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.

Collections

Artikuluak

© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Bestelakorik adierazi ezean, itemaren baimena horrela deskribatzen da:© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).