Evaluation of STT technologies performance and database design for Spanish dysarthric speech
Laburpena
[EN] Automatic Speech Recognition (ASR) systems have become an everyday use tool worldwide. Their use has spread throughout these last years and they have also been implemented in Environmental Control Systems (ECS) or Speech Generating Devices (SGD), among others. These systems might be especially beneficial for people with physical disabilities, as they would be able to control different devices with voice commands, therefore reducing the physical effort they have to make. However, people with functional diversity usually present difficulties in speech articulation too. One of the most common speech articulation problems is dysarthria, a disorder in the nervous system which causes weakness in muscles used for speech. Existing commercial ASR systems are not able to correctly understand dysarthric speech, so people with this condition cannot exploit this technology. Some investigation tackling this issue has been conducted, but an optimal solution has not been reached yet. On the other hand, nearly all existing investigation on the matter is in English, no previous study has approached the problem in other languages. Apart form this, ASR systems require of large speech databases, which are currently very few, most of them in English and they have not been designed for this end. Some commercial ASR systems offer a customization interface where users can train a base model with their speech data and thus improve the recognition accuracy. In this thesis, we evaluated the performance of the commercial ASR system Microsoft Azure Speech to Text. First, we reviewed the current state of the art. Then, we created a pilot database in Spanish and recorded it with 3 heterogeneous people with dysarthria and 1 typical speaker to be used as reference. Lastly, we trained the system and conducted different experiments to measure its accuracy. Results show that, overall, the customized models outperform the base models of the system. However, the results were not homogeneous, but vary depending on the speaker. Even though the recognition accuracy improved considerably, the results were far from being as good as those obtained for typical speech.