dc.contributor.author | Salaberria Saizar, Ander | |
dc.contributor.author | Azkune Galparsoro, Gorka | |
dc.contributor.author | López de Lacalle Lecuona, Oier | |
dc.contributor.author | Soroa Echave, Aitor | |
dc.contributor.author | Agirre Bengoa, Eneko | |
dc.date.accessioned | 2023-02-15T18:00:45Z | |
dc.date.available | 2023-02-15T18:00:45Z | |
dc.date.issued | 2023-02 | |
dc.identifier.citation | Expert Systems with Applications 212 : (2023) // Article ID 118669 | es_ES |
dc.identifier.issn | 0957-4174 | |
dc.identifier.issn | 1873-6793 | |
dc.identifier.uri | http://hdl.handle.net/10810/59873 | |
dc.description.abstract | Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks. | es_ES |
dc.description.sponsorship | Ander is funded by a PhD grant from the Basque Government (PRE_2021_2_0143). This work is based upon work partially supported by the Ministry of Science and Innovation of the Spanish Government (DeepKnowledge project PID2021-127777OB-C21), and the Basque Government (IXA excellence research group IT1570-22). | es_ES |
dc.language.iso | eng | es_ES |
dc.publisher | Elsevier | es_ES |
dc.relation | info:eu-repo/grantAgreement/MICINN/PID2021-127777OB-C21 | es_ES |
dc.rights | info:eu-repo/semantics/openAccess | es_ES |
dc.rights.uri | http://creativecommons.org/licenses/by/3.0/es/ | * |
dc.subject | visual question answering | es_ES |
dc.subject | image captioning | es_ES |
dc.subject | language models | es_ES |
dc.subject | deep learning | es_ES |
dc.title | Image captioning for effective use of language models in knowledge-based visual question answering | es_ES |
dc.type | info:eu-repo/semantics/article | es_ES |
dc.rights.holder | © 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). | es_ES |
dc.rights.holder | Atribución 3.0 España | * |
dc.relation.publisherversion | https://www.sciencedirect.com/science/article/pii/S0957417422017055?via%3Dihub | es_ES |
dc.identifier.doi | 10.1016/j.eswa.2022.118669 | |
dc.departamentoes | Ciencia de la computación e inteligencia artificial | es_ES |
dc.departamentoes | Lenguajes y sistemas informáticos | es_ES |
dc.departamentoeu | Hizkuntza eta sistema informatikoak | es_ES |
dc.departamentoeu | Konputazio zientziak eta adimen artifiziala | es_ES |