Muturretik muturrerako informazio erauzketa eleaniztuna eta hizkuntzen arteko terminoen lerrokatze neuronala

Susperregi Indakoetxea, Mari

Ver/

MAL-Mari_Susperregi.pdf (1.945Mb)

Fecha

2020-11-26

Autor

Susperregi Indakoetxea, Mari

Metadatos

Mostrar el registro completo del ítem

URI

http://hdl.handle.net/10810/48625

Resumen

Lan honen helburua testu-bikote elebidunetatik terminoak lortzea da, hau da, euskarazko eta gaztelaniazko esaldi parekatuak emanda, esaldi horietan agertzen diren termino-bikote esanguratsuak lortzea: glosarioak egiteko, hiztegiak aberasteko, NLP atazetan erabili ahal izateko... Helburu hori lortzeko Erauzterm eta Itzulterm (antzeko helburua lortzeko aurretik egindako tresnak) tresnekin lortutako corpusa erabili da; corpus horiek eta sare neuronalen teknikak erabiliz, Itzulterm tresna gaur egungo teknologietara eguneratzea izan da helburua. Lan hau garatzeko sekuentziatik sekuentziarako hurbilpenean oinarritua dagoen transformer teknologia erabili da. Orain arte erabili diren metodo linguistiko eta estatistikoak erabili ordez sare neuronalen teknikak erabili dira ataza hau garatzeko; hau da, hizkuntza bakoitzeko terminoak erauzi eta terminoak lerrokatzea bata bestearen ondoren egin ordez, ekintza guztiak aldi-berean egingo dira, muturretik muturrerako ataza bihurtuz eta erroreen propagazioa gutxiagotuz. Sistema ebaluatzeko BLEU metrika erabiltzeaz gain, lan honetarako berariaz sortu den TEB (Termino-Erauzle Balidazioa) metrika ere erabili da. Metrika horrek BLEUk kontuan hartzen ez dituen eta terminologia-erauzketarako garrantzitsuak diren ezaugarri batzuk hartzen ditu kontuan ebaluazioa egiteko. Master amaierako lan honetarako garatutako sistemak BLEU metrikan 0,78 puntuko balioa lortu du. Eta ebaluaziorako erabili den oinarri lerroarekiko BLEU metrikan 50 puntuko hobekuntza lortu du. Ondorioz, terminologia-erauzketa ataza sare neuronalen teknologiak erabiliz garatu daitekeela frogatu da.

The aim of this work is to obtain terms of bilingual text pairs. In other words, using matched phrases in Basque and Spanish, obtain pairs of signi cant terms that appear in the phrases for multiple purposes: making glossaries, enriching dictionaries to use them in NLP tasks, etc. For this objective, the corpus used was obtained with the tools Erauzterm and Itzulterm (previous tools to achieve a similar objective). For the development of this work transformer technology based on sequence to sequence architecture has been used. Instead of using the linguistic and statistical methods as in previous works, neural network techniques have been used to develop this task. In other words, instead of extracting the terms of each language and performing the alignment of the terms in a pipeline, all actions are carried out simultaneously, becoming end-to-end tasks and reducing the spread of errors. In addition to the system evaluation BLEU metric, the TEB metric has been used, which was created specially for this work. This metric takes into account some features that BLEU does not contemplate and which are important for terminological extraction. The system developed for this master's thesis has obtained a value of 0.78 points in BLEU metric. And it has achieved an improvement of 50 points in the BLEU metric compared to the baseline used for evaluation. Consequently, it has been shown that the terminological extraction task can be developed using neuronal network technologies.

Colecciones

Máster Universitario en Análisis y Procesamiento del Lenguaje

Estadisticas RECOLECTA - LA Referencia

Excepto si se señala otra cosa, la licencia del ítem se describe como Atribución-NoComercial-CompartirIgual 3.0 España