Show simple item record

dc.contributor.authorVarona Fernández, Amparo
dc.contributor.authorPeñagarikano Badiola, Mikel ORCID
dc.contributor.authorBordel García, German
dc.contributor.authorRodríguez Fuentes, Luis Javier ORCID
dc.date.accessioned2024-03-26T17:27:10Z
dc.date.available2024-03-26T17:27:10Z
dc.date.issued2024-02-27
dc.identifier.citationApplied Sciences 14(5) : (2024) // Article ID 1951es_ES
dc.identifier.issn2076-3417
dc.identifier.urihttp://hdl.handle.net/10810/66474
dc.description.abstractThe development of speech technology requires large amounts of data to estimate the underlying models. Even when relying on large multilingual pre-trained models, some amount of task-specific data on the target language is needed to fine-tune those models and obtain competitive performance. In this paper, we present a bilingual Basque–Spanish dataset extracted from parliamentary sessions. The dataset is designed to develop and evaluate automatic speech recognition (ASR) systems but can be easily repurposed for other speech-processing tasks (such as speaker or language recognition). The paper first compares the two target languages, emphasizing their similarities at the acoustic-phonetic level, which sets the basis for sharing data and compensating for the relatively small amount of spoken resources available for Basque. Then, Basque Parliament plenary sessions are characterized in terms of organization, topics, speaker turns and the use of the two languages. The paper continues with the description of the data collection procedure (involving both speech and text), the audio formats and conversions along with the creation and postprocessing of text transcriptions and session minutes. Then, it describes the semi-supervised iterative procedure used to cut, rank and select the training segments and the manual supervision process employed to produce the test set. Finally, ASR experiments are presented using state-of-the-art technology to validate the dataset and to set a reference for future works. The datasets, along with models and recipes to reproduce the experiments reported in the paper, are released through Hugging Face.es_ES
dc.description.sponsorshipThis work was partially funded by the Spanish Ministry of Science and Innovation (OPEN-SPEECH project, PID2019-106424RB-I00) and by the Basque Government under the general support program to research groups (IT-1704-22).es_ES
dc.language.isoenges_ES
dc.publisherMDPIes_ES
dc.relationinfo:eu-repo/grantAgreement/MICINN/PID2019-106424RB-I00es_ES
dc.rightsinfo:eu-repo/semantics/openAccesses_ES
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/es/
dc.subjectmultilingual speeches_ES
dc.subjectBasquees_ES
dc.subjectSpanishes_ES
dc.subjectspoken language resourceses_ES
dc.subjectlow-resource languageses_ES
dc.subjectsemisupervised learninges_ES
dc.subjectautomatic speech recognitiones_ES
dc.titleA Bilingual Basque–Spanish Dataset of Parliamentary Sessions for the Development and Evaluation of Speech Technologyes_ES
dc.typeinfo:eu-repo/semantics/articlees_ES
dc.date.updated2024-03-12T16:38:17Z
dc.rights.holder© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/ 4.0/).es_ES
dc.relation.publisherversionhttps://www.mdpi.com/2076-3417/14/5/1951es_ES
dc.identifier.doi10.3390/app14051951
dc.departamentoesElectricidad y electrónica
dc.departamentoeuElektrizitatea eta elektronika


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/ 4.0/).
Except where otherwise noted, this item's license is described as © 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/ 4.0/).