Unsupervised information retrieval using large language models

Domínguez Becerril, Carlos

dc.contributor.advisor	Agirre Bengoa, Eneko
dc.contributor.advisor	Azkune Galparsoro, Gorka
dc.contributor.author	Domínguez Becerril, Carlos
dc.date.accessioned	2023-06-30T15:15:25Z
dc.date.available	2023-06-30T15:15:25Z
dc.date.issued	2023-06-30
dc.identifier.uri	http://hdl.handle.net/10810/61836
dc.description.abstract	Nowadays, to tackle the open domain Question Answering (QA) problem, a neural architecture with two main components is usually used: the information retriever, whose task is to search for the most relevant documents with respect to the question, and the reader, which given the question and the extracted documents that serve as context, generates the appropriate answer. In this project, we propose to investigate open domain QA, with a special focus on the first component of the architecture, that is, the information retrieval component. We want to train several dense neural retrievers in an unsupervised manner by generating questions from the documents using Large Language Models (LLM). Currently, most LLMs provide several checkpoints with different amounts of parameters, and we want to use those checkpoints to generate questions, train a dense neural retriever system for each LLM checkpoint, and finally, compare if the generated questions have any influence in the performance of the systems. The developed system must be able to search for the necessary information in external sources, usually organized in text documents. For that purpose, the BEIR benchmark \cite{BEIR} will be used in a zero-shot manner to test the performance. As the result of this exploration, we found that using LLMs to generate questions can be helpful in order to train information retrieval systems as it achieves similar performance to supervised systems. More concretely, we found that: (i) the more parameters the LLM has the more performance we obtain, (ii) using sampling to generate questions can further increase the performance, and (iii) generating more questions using a smaller language model is not worth it as a checkpoint with more parameters can do a better job. Moreover, we found that using different prompts and/or domain adaptation on a specific dataset can improve the performance slightly.	es_ES
dc.language.iso	eng	es_ES
dc.rights	info:eu-repo/semantics/openAccess
dc.subject	artificial Intelligence	es_ES
dc.subject	deep learning	es_ES
dc.subject	natural language processing	es_ES
dc.subject	dense passage retrieval	es_ES
dc.subject	question generation	es_ES
dc.subject	unsupervised training	es_ES
dc.title	Unsupervised information retrieval using large language models	es_ES
dc.type	info:eu-repo/semantics/masterThesis
dc.date.updated	2023-02-09T11:17:24Z
dc.language.rfc3066	es
dc.rights.holder	© 2023, el autor
dc.contributor.degree	Máster Universitario en Análisis y Procesamiento del Lenguaje
dc.contributor.degree	Hizkuntzaren Azterketa eta Prozesamendua Unibertsitate Masterra
dc.identifier.gaurregister	128881-879042-05	es_ES
dc.identifier.gaurassign	147928-879042	es_ES

Files in this item

Name:: TFM_Carlos_Dominguez.pdf
Size:: 1.096Mb
Format:: PDF
Description:: TFM_Language_Analysis

View/Open

This item appears in the following Collection(s)

Máster Universitario en Análisis y Procesamiento del Lenguaje

Show simple item record