Answering questions about images that require outside knowledge

Miranda Martija, Imanol

dc.contributor.advisor	Azkune Galparsoro, Gorka
dc.contributor.advisor	Soroa Echave, Aitor
dc.contributor.author	Miranda Martija, Imanol
dc.contributor.other	Máster Universitario en Ingeniería Computacional y Sistemas Inteligentes
dc.contributor.other	Konputazio Ingeniaritza eta Sistema Adimentsuak Unibertsitate Masterra
dc.date.accessioned	2024-07-09T07:16:33Z
dc.date.available	2024-07-09T07:16:33Z
dc.date.issued	2024-07-09
dc.identifier.uri	http://hdl.handle.net/10810/68845
dc.description.abstract	[EN] The emergence of Transformer architectures, pretrained models and multimodal data problems have generated new challenges to solve. One of the most popular in recent years is the visual-linguistic task Visual Question Answering (VQA). Several variants of this task have emerged, one of them being the Outside Knowledge Visual Question Answering (OK-VQA) task, on which our research will focus. This task adds the complexity that the answer to the question does not appear explicitly in the image, and an external source of knowledge is needed to answer the question. Once the different proposals have been analyzed, the Caption Based Model (CBM) that will serve as the basis for the development is presented. After the problem has been introduced, the proposals are presented, divided into two groups. On the one hand, a multilabel leveraging technique that can be used in multilabel tasks that have optimal and suboptimal solutions, improving model learning. This technique introduces the concept of balance between exploration and exploitation by means of a frequency distribution based on the proportion in which the solutions appear in the ground truth. On the other hand, different image verbalization approaches are analyzed and compared. First, using an object detector, the objects and attributes that appear in an image are obtained. Thus, in addition to providing the CBM model with the image caption (where general image information is represented), we also provide object and attribute information (representing image details). In this way, the balance between general and detailed information is improved. Secondly, due to memory limitations, several reranking systems based on Sentence Similarity and Object bounding box area are presented. These systems seek to improve the quality of the information we pass to the model with respect to the question. After several experiments, we conclude that the new multilabel leverage technique improves model learning by maintaining the number of optimal solutions and increasing the number of suboptimal solutions generated. Also, providing more information to the model improves the results, both by adding attributes to the objects, and by increasing the number of objects. The reranking system based on Object bounding box area gets the best results, reinforcing the idea that the questions focus on objects clearly represented in the image.	es_ES
dc.description.abstract	[EUS] Transformer arkitekturak, aurrez entrenatutako ereduak eta datu multimodalak dituzten arazoen sorrerak erronka berriak sortu ditu konpontzeko. Azken urteotako ezagunenetako bat Visual Question Answering (VQA) ataza da. Bertan, irudi bat eta galdera bat emanda, erantzun egokia bilatu behar da. Ataza honen hainbat aldaera sortu dira, horietako bat Outside Knowledge Visual Question Answering (OK-VQA) izanik, non gure ikerketa oinarrituko den. Ataza honek konplexutasuna areagotzen du galderaren erantzuna irudian esplizituki ez baita agertzen, eta galderari erantzun ahal izateko kanpoko ezagutza iturri bat behar baita. Ataza ebazteko proposamen ezberdinak aztertu ondoren, gure garapenaren oinarri izango den Caption Based Model-a (CBM) aurkezten da. Behin arazoa azalduta, bi taldetan banatuta aurkezten dira lan honetan egindako ekarpenak. Alde batetik, etiketa anitzeko aprobetxamendu teknika, soluzio optimoak eta azpi-optimoak dituzten etiketa anitzeko atazetan erabil daitekeena, ereduen ikaskuntza hobetuz. Teknika honek esplorazioaren eta esplotazioaren arteko orekaren kontzeptua erabiltzen du maiztasun-banaketa baten bidez, soluzioak agertzen diren proportzioan oinarrituta. Bestalde, irudiak berbalizatzeko planteamendu desberdinak aztertu eta alderatzen dira. Lehenik eta behin, objektu detektore bat erabiliz, irudi batean agertzen diren objektuak eta atributuak lortzen dira. Horrela, CBM ereduari irudiaren goiburukoa emateaz gain (non irudiaren informazio orokorra errepresentatzen den), objektu eta atributuen informazioa ere ematen diogu (irudiaren xehetasunak errepresentatuz). Modu horretara informazio orokorraren eta zehatzaren arteko oreka hobetzen da. Bigarrenik, memoria murriztapenak direla eta, esaldiaren antzekotasunean eta objektuen kutxa mugatzaileen azaleran oinarritutako hainbat sailkapen-sistema aurkezten dira. Sistema hauek ereduari pasatzen diogun informazioaren kalitatea hobetzea dute helburu. Hainbat esperimenturen ondoren, etiketa anitzeko aprobetxamendu teknika berriak ereduaren ikaskuntza hobetzen duela ondorioztatzen dugu, soluzio optimoen kopurua mantenduz eta sortutako soluzio azpi-optimoen kopurua handituz. Era berean, ereduari informazio gehiago emateak emaitzak hobetzen ditu, bai objektuei atributuak gehituz, baita objektu kopurua handituz ere. Objektuen kutxa mugatzaileen azaleran oinarritutako sailkapen-sistemak emaitza onenak lortzen ditu, galderak irudian argi irudikatutako objektuetan oinarritzen direlako ideia indartuz.	es_ES
dc.language.iso	eng	es_ES
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	transformers	es_ES
dc.subject	multimodal	es_ES
dc.subject	OK-VQA	es_ES
dc.subject	CBM	es_ES
dc.subject	multilabel leverage technique	es_ES
dc.subject	object detection	es_ES
dc.subject	reranking system	es_ES
dc.title	Answering questions about images that require outside knowledge	es_ES
dc.title.alternative	Kanpo-ezagutza behar duten irudien gaineko galderak erantzuten	es_ES
dc.type	info:eu-repo/semantics/masterThesis
dc.date.updated	2023-06-16T12:21:40Z
dc.language.rfc3066	es
dc.rights.holder	© 2023, el autor
dc.identifier.gaurregister	132741-810752-10	es_ES
dc.identifier.gaurassign	149761-810752	es_ES

Files in this item

Name:: MAL_Imanol_Miranda.pdf
Size:: 13.59Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Máster Universitario en Ingeniería Computacional y Sistemas Inteligentes

Show simple item record