Automating the anonymisation of textual corpora

García Sardiña, Laura

dc.contributor.advisor	Del Pozo, Arantza
dc.contributor.advisor	Aldezabal Roteta, Izaskun
dc.contributor.author	García Sardiña, Laura
dc.date.accessioned	2018-10-05T11:52:57Z
dc.date.available	2018-10-05T11:52:57Z
dc.date.issued	2018-11-04
dc.date.submitted	2018-09-25
dc.identifier.uri	http://hdl.handle.net/10810/28987
dc.description.abstract	[EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten, dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea, errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz, anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko, eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko, Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu handiagoak lortuz iterazio gutxiagotan.	es_ES
dc.description.abstract	[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations.	es_ES
dc.language.iso	eng	es_ES
dc.rights	info:eu-repo/semantics/openAccess	es_ES
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/es/	*
dc.subject	automating
dc.subject	anonymisation
dc.subject	textual corpora
dc.title	Automating the anonymisation of textual corpora	es_ES
dc.type	info:eu-repo/semantics/masterThesis	es_ES
dc.rights.holder	Atribución-NoComercial-CompartirIgual 3.0 España	*
dc.departamentoes	Lengua Vasca y Comunicación	es_ES
dc.departamentoeu	Euskal Hizkuntza eta Komunikazioa	es_ES

Files in this item

Name:: license_rdf
Size:: 811bytes
Format:: application/rdf+xml

View/Open

Name:: TFM_Laura_Garcia.pdf
Size:: 1.080Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Máster Universitario en Análisis y Procesamiento del Lenguaje

Show simple item record

Except where otherwise noted, this item's license is described as Atribución-NoComercial-CompartirIgual 3.0 España