Oesophageal speech: enrichment and evaluations
Laburpena
After a laryngectomy (i.e. removal of the larynx) a patient can no more speak in a healthy laryngeal voice. Therefore, they need to adopt alternative methods of speaking such as oesophageal speech. In this method, speech is produced using swallowed air and the vibrations of the pharyngo-oesophageal segment, which introduces several undesired artefacts and an abnormal fundamental frequency. This makes oesophageal speech processing difficult compared to healthy speech, both auditory processing and signal processing. The aim of this thesis is to find solutions to make oesophageal speech signals easier to process, and to evaluate these solutions by exploring a wide range of evaluation metrics.First, some preliminary studies were performed to compare oesophageal speech and healthy speech. This revealed significantly lower intelligibility and higher listening effort for oesophageal speech compared to healthy speech. Intelligibility scores were comparable for familiar and non-familiar listeners of oesophageal speech. However, listeners familiar with oesophageal speech reported less effort compared to non-familiar listeners. In another experiment, oesophageal speech was reported to have more listening effort compared to healthy speech even though its intelligibility was comparable to healthy speech. On investigating neural correlates of listening effort (i.e. alpha power) using electroencephalography, a higher alpha power was observed for oesophageal speech compared to healthy speech, indicating higher listening effort. Additionally, participants with poorer cognitive abilities (i.e. working memory capacity) showed higher alpha power.Next, using several algorithms (preexisting as well as novel approaches), oesophageal speech was transformed with the aim of making it more intelligible and less effortful. The novel approach consisted of a deep neural network based voice conversion system where the source was oesophageal speech and the target was synthetic speech matched in duration with the source oesophageal speech. This helped in eliminating the source-target alignment process which is particularly prone to errors for disordered speech such as oesophageal speech. Both speaker dependent and speaker independent versions of this system were implemented. The outputs of the speaker dependent system had better short term objective intelligibility scores, automatic speech recognition performance and listener preference scores compared to unprocessed oesophageal speech. The speaker independent system had improvement in short term objective intelligibility scores but not in automatic speech recognition performance. Some other signal transformations were also performed to enhance oesophageal speech. These included removal of undesired artefacts and methods to improve fundamental frequency. Out of these methods, only removal of undesired silences had success to some degree (1.44 \% points improvement in automatic speech recognition performance), and that too only for low intelligibility oesophageal speech.Lastly, the output of these transformations were evaluated and compared with previous systems using an ensemble of evaluation metrics such as short term objective intelligibility, automatic speech recognition, subjective listening tests and neural measures obtained using electroencephalography. Results reveal that the proposed neural network based system outperformed previous systems in improving the objective intelligibility and automatic speech recognition performance of oesophageal speech. In the case of subjective evaluations, the results were mixed - some positive improvement in preference scores and no improvement in speech intelligibility and listening effort scores. Overall, the results demonstrate several possibilities and new paths to enrich oesophageal speech using modern machine learning algorithms. The outcomes would be beneficial to the disordered speech community.