Grounding Language Models for Compositional and Spatial Reasoning

Etxaniz Aragoneses, Julen

dc.contributor.advisor	López de Lacalle Lecuona, Oier
dc.contributor.advisor	Soroa Echave, Aitor
dc.contributor.author	Etxaniz Aragoneses, Julen
dc.date.accessioned	2023-06-30T15:07:02Z
dc.date.available	2023-06-30T15:07:02Z
dc.date.issued	2023-06-30
dc.identifier.uri	http://hdl.handle.net/10810/61827
dc.description.abstract	Humans can learn to understand and process the distribution of space, and one of the initial tasks of Artificial Intelligence has been to show machines the relationships between space and the objects that appear in it. Humans naturally combine vision and textual information to acquire compositional and spatial relationships among objects, and when reading a text, we are able to mentally depict the spatial relationships that may appear in it. Thus, the visual differences between images depicting "a person sits and a dog stands" and "a person stands and a dog sits" are obvious for humans, but still not clear for automatic systems. In this project, we propose to evaluate grounded Neural Language models that can perform compositional and spatial reasoning. Neural Language models (LM) have shown impressive capabilities on many NLP tasks but, despite their success, they have been criticized for their lack of meaning. Vision-and-Language models (VLM), trained jointly on text and image data, have been offered as a response to such criticisms, but recent work has shown that these models struggle to ground spatial concepts properly. In the project, we evaluate state-of-the-art pre-trained and fine-tuned VLMs to understand their grounding level on compositional and spatial reasoning. We also propose a variety of methods to create synthetic datasets specially focused on compositional reasoning. We managed to accomplish all the objectives of this work. First, we improved the state-of-the-art in compositional reasoning. Next, we performed some zero-shot experiments on spatial reasoning. Finally, we explored three alternatives for synthetic dataset creation: text-to-image generation, image captioning and image retrieval. Code is released at https://github.com/juletx/spatial-reasoning and models are released at https://huggingface.co/juletxara.	es_ES
dc.language.iso	eng	es_ES
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	artificial intelligence	es_ES
dc.subject	deep learning	es_ES
dc.subject	natural language processing	es_ES
dc.subject	computer vision	es_ES
dc.subject	grounding	es_ES
dc.subject	visual reasoning	es_ES
dc.subject	compositional reasoning	es_ES
dc.subject	Spatial Reasoning	es_ES
dc.title	Grounding Language Models for Compositional and Spatial Reasoning	es_ES
dc.type	info:eu-repo/semantics/masterThesis
dc.date.updated	2022-10-17T08:00:16Z
dc.language.rfc3066	es
dc.rights.holder	© 2022, el autor
dc.contributor.degree	Máster Universitario en Análisis y Procesamiento del Lenguaje
dc.contributor.degree	Hizkuntzaren Azterketa eta Prozesamendua Unibertsitate Masterra
dc.identifier.gaurregister	128439-870161-01	es_ES
dc.identifier.gaurassign	141662-870161	es_ES

Files in this item

Name:: MasterThesis_julen_etxaniz.pdf
Size:: 9.730Mb
Format:: PDF
Description:: Master_Thesis

View/Open

This item appears in the following Collection(s)

Máster Universitario en Análisis y Procesamiento del Lenguaje

Show simple item record