Unsupervised methods to predict example difficulty in word sense annotation

Aceta Moreno, Cristina

View/Open

TFM Cristina Aceta (1.262Mb)

Date

2018-06

Author

Aceta Moreno, Cristina

Metadata

Show full item record

Estadisticas en RECOLECTA
(LA Referencia)

URI

http://hdl.handle.net/10810/27867

Abstract

[EU]Hitzen Adiera Desanbiguazioa (HAD) Hizkuntzaren Prozesamenduko (HP) erronkarik handienetakoa da. Frogatu denez, HAD sistema ahalik eta arrakastatsuenak entrenatzeko, oso garrantzitsua da entrenatze-datuetatik adibide (hitzen testuinguru) zailak kentzea, honela emaitzak asko hobetzen baitira. Lan honetan, lehenik, gainbegiratutako ereduak aztertzen ditugu, eta, ondoren, gainbegiratu gabeko bi neurri proposatzen ditugu. Gainbegiratutako ereduetan, adibideen zailtasuna definitzeko, anotatutako corpuseko datuak erabiltzen dira. Proposatzen ditugun bi gainbegiratu gabeko neurrietan, berriz, batetik, aztergai den hitzaren zailtasuna neurtzen da (hitzon Wordnet-eko datuak aztertuta), eta, bestetik, hitzaren agerpenarena (alegia, hitzaren testuinguruarena edo adibidearena). Biak konbinatuta, adibideen zailtasuna ezaugarritzeko eredu bat ere proposatzen da.

[EN]Word Sense Disambiguation (WSD) is one of the major challenges in Natural Language Processing (NLP). In order to train successful WSD systems, it has been proved that removing difficult examples (words in a context) from the training set improves the performance of these systems. In this work, we first analyze supervised models that, given annotated data, characterize the difficulty of examples. We then propose two unsupervised measures to characterize the difficulty of target words (by analyzing their WordNet data) and occurrences (context sentences), respectively. Combining them, a model able to characterize the difficulty of examples is also presented.

Collections

Máster Universitario en Análisis y Procesamiento del Lenguaje

Except where otherwise noted, this item's license is described as Atribución-NoComercial-CompartirIgual 3.0 España