Towards Orthographic and Grammatical Clinical Text Correction: a First Approach
View/ Open
Date
2020-11-26Author
Lima López, Salvador
Metadata
Show full item recordAbstract
Akats Gramatikalen Zuzenketa (GEC, ingelesetik, Grammatical Error Analysis)
Hizkuntza Naturalaren Prozesamenduaren azpieremu bat da, ortogra a, puntuazio edo
gramatika akatsak dituzten testuak automatikoki zuzentzea helburu duena. Orain arte,
bigarren hizkuntzako ikasleek ekoitzitako testuetara bideratu da gehien bat, ingelesez
idatzitako testuetara batez ere. Master-Tesi honetan gaztelaniaz idatzitako
mediku-txostenetarako Akats Gramatikalen Zuzenketa lantzen da. Arlo espezi ko hau ez
da asko esploratu orain arte, ez gaztelaniarako zentzu orokorrean, ezta domeinu
klinikorako konkretuki ere. Hasteko, IMEC (gaztelaniatik, Informes Médicos en Español
Corregidos) corpusa aurkezten da, eskuz zuzendutako mediku-txosten elektronikoen
bilduma paralelo berria. Corpusa automatikoki etiketatu da zeregin honetarako
egokitutako ERRANT tresna erabiliz. Horrez gain, hainbat esperimentu deskribatzen
dira, zeintzuetan sare neuronaletan oinarritutako sistemak ataza honetarako
diseinatutako baseline sistema batekin alderatzen diren. Grammatical Error Correction (GEC) is a sub field of Natural Language Processing that aims to automatically correct texts that include errors related to spelling, punctuation or grammar. So far, it has mainly focused on texts produced by second language learners, mostly in English. This Master's Thesis describes a first approach to Grammatical Error Correction for Spanish health records. This specific field has not been explored much until now, nor in Spanish in a general sense nor for the clinical domain specifically. For this purpose, the corpus IMEC (Informes Médicos en Español Corregidos) ---a manually-corrected parallel collection of Electronic Health Records--- is introduced. This corpus has been automatically annotated using the toolkit ERRANT, specialized in the automatic annotation of GEC parallel corpora, which was adapted to Spanish for this task. Furthermore, some experiments using neural networks and data augmentation are shown and compared with a baseline system also created specifically for this task.