Analysis of the Sensitivity of the End-Of-Turn Detection Task to Errors Generated by the Automatic Speech Recognition Process

Montenegro Portillo, César; Santana Hermida, Roberto; Lozano Alonso, José Antonio

View/Open

Artículo principal (1.349Mb)

Date

2021-04

Author

Montenegro Portillo, César

Santana Hermida, Roberto

Lozano Alonso, José Antonio

Metadata

Show full item record

Estadisticas en RECOLECTA
(LA Referencia)

Engineering Applications of Artificial Intelligence 100 : (2021) // Article ID 104189

URI

http://hdl.handle.net/10810/51456

Abstract

An End-Of-Turn Detection Module (EOTD-M) is an essential component of automatic Spoken Dialogue Systems. The capability of correctly detecting whether a user?s utterance has ended or not improves the accuracy in interpreting the meaning of the message and decreases the latency in the answer. Usually, in dialogue systems, an EOTD-M is coupled with an Automatic Speech Recognition Module (ASR-M) to transmit complete utterances to the Natural Language Understanding unit. Mistakes in the ASR-M transcription can have a strong effect on the performance of the EOTD-M. The actual extent of this effect depends on the particular combination of ASR M transcription errors and the sentence featurization techniques implemented as part of the EOTD-M. In this paper we investigate this important relationship for an EOTD-M based on semantic information and particular characteristics of the speakers (speech profiles). We introduce an Automatic Speech Recognition Simulator (ASR-SIM) that models different types of semantic mistakes in the ASR-M transcription as well as different speech profiles. We use the simulator to evaluate the sensitivity to ASR-M mistakes of a Long Short-Term Memory network classifier trained in EOTD with different featurization techniques. Our experiments reveal the different ways in which the performance of the model is influenced by the ASR-M errors. We corroborate that not only is the ASR-SIM useful to estimate the performance of an EOTD-M in customized noisy scenarios, but it can also be used to generate training datasets with the expected error rates of real working conditions, which leads to better performance.

Collections

This is an open access article distributed under the terms of the Creative Commons CC-BY license

Except where otherwise noted, this item's license is described as This is an open access article distributed under the terms of the Creative Commons CC-BY license