Unsupervised information retrieval using large language models

Domínguez Becerril, Carlos

View/Open

TFM_Language_Analysis (1.096Mb)

Date

2023-06-30

Author

Domínguez Becerril, Carlos

Metadata

Show full item record

Estadisticas en RECOLECTA
(LA Referencia)

URI

http://hdl.handle.net/10810/61836

Abstract

Nowadays, to tackle the open domain Question Answering (QA) problem, a neural architecture with two main components is usually used: the information retriever, whose task is to search for the most relevant documents with respect to the question, and the reader, which given the question and the extracted documents that serve as context, generates the appropriate answer. In this project, we propose to investigate open domain QA, with a special focus on the first component of the architecture, that is, the information retrieval component. We want to train several dense neural retrievers in an unsupervised manner by generating questions from the documents using Large Language Models (LLM). Currently, most LLMs provide several checkpoints with different amounts of parameters, and we want to use those checkpoints to generate questions, train a dense neural retriever system for each LLM checkpoint, and finally, compare if the generated questions have any influence in the performance of the systems. The developed system must be able to search for the necessary information in external sources, usually organized in text documents. For that purpose, the BEIR benchmark \cite{BEIR} will be used in a zero-shot manner to test the performance. As the result of this exploration, we found that using LLMs to generate questions can be helpful in order to train information retrieval systems as it achieves similar performance to supervised systems. More concretely, we found that: (i) the more parameters the LLM has the more performance we obtain, (ii) using sampling to generate questions can further increase the performance, and (iii) generating more questions using a smaller language model is not worth it as a checkpoint with more parameters can do a better job. Moreover, we found that using different prompts and/or domain adaptation on a specific dataset can improve the performance slightly.

Collections

Máster Universitario en Análisis y Procesamiento del Lenguaje