Show simple item record

dc.contributor.authorSalaberria Saizar, Ander
dc.contributor.authorAzkune Galparsoro, Gorka
dc.contributor.authorLópez de Lacalle Lecuona, Oier ORCID
dc.contributor.authorSoroa Echave, Aitor ORCID
dc.contributor.authorAgirre Bengoa, Eneko ORCID
dc.date.accessioned2023-02-15T18:00:45Z
dc.date.available2023-02-15T18:00:45Z
dc.date.issued2023-02
dc.identifier.citationExpert Systems with Applications 212 : (2023) // Article ID 118669es_ES
dc.identifier.issn0957-4174
dc.identifier.issn1873-6793
dc.identifier.urihttp://hdl.handle.net/10810/59873
dc.description.abstractIntegrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. More specifically, we verbalize the image contents and allow language models to better leverage their implicit knowledge to solve knowledge-intensive tasks. Focusing on a visual question answering task which requires external knowledge (OK-VQA), our contributions are: (i) a text-only model that outperforms pretrained multimodal (image-text) models of comparable number of parameters; (ii) confirmation that our text-only method is specially effective for tasks requiring external knowledge, as it is less effective in standard a VQA task (VQA 2.0); and (iii) our method attains results in the state-of-the-art when increasing the size of the language model. We also significantly outperform current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks.es_ES
dc.description.sponsorshipAnder is funded by a PhD grant from the Basque Government (PRE_2021_2_0143). This work is based upon work partially supported by the Ministry of Science and Innovation of the Spanish Government (DeepKnowledge project PID2021-127777OB-C21), and the Basque Government (IXA excellence research group IT1570-22).es_ES
dc.language.isoenges_ES
dc.publisherElsevieres_ES
dc.relationinfo:eu-repo/grantAgreement/MICINN/PID2021-127777OB-C21es_ES
dc.rightsinfo:eu-repo/semantics/openAccesses_ES
dc.rights.urihttp://creativecommons.org/licenses/by/3.0/es/*
dc.subjectvisual question answeringes_ES
dc.subjectimage captioninges_ES
dc.subjectlanguage modelses_ES
dc.subjectdeep learninges_ES
dc.titleImage captioning for effective use of language models in knowledge-based visual question answeringes_ES
dc.typeinfo:eu-repo/semantics/articlees_ES
dc.rights.holder© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).es_ES
dc.rights.holderAtribución 3.0 España*
dc.relation.publisherversionhttps://www.sciencedirect.com/science/article/pii/S0957417422017055?via%3Dihubes_ES
dc.identifier.doi10.1016/j.eswa.2022.118669
dc.departamentoesCiencia de la computación e inteligencia artificiales_ES
dc.departamentoesLenguajes y sistemas informáticoses_ES
dc.departamentoeuHizkuntza eta sistema informatikoakes_ES
dc.departamentoeuKonputazio zientziak eta adimen artifizialaes_ES


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record

© 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Except where otherwise noted, this item's license is described as © 2023 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).