Audio Embedding-Aware Dialogue Policy Learning

López Zorrilla, Asier; Torres Barañano, María Inés; Cuayáhuitl, Heriberto

Ver/

Postprint (14.17Mb)

Fecha

2022-11-30

Autor

López Zorrilla, Asier

Torres Barañano, María Inés

Cuayáhuitl, Heriberto

Metadatos

Mostrar el registro completo del ítem

Estadisticas en RECOLECTA
(LA Referencia)

IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 : 525-538 (2023)

URI

http://hdl.handle.net/10810/59507

Resumen

Following the success of Natural Language Processing (NLP) transformers pretrained via self-supervised learning, similar models have been proposed recently for speech processing such as Wav2Vec2, HuBERT and UniSpeech-SAT. An interesting yet unexplored area of application of these models is Spoken Dialogue Systems, where the users’ audio signals are typically just mapped to word-level features derived from an Automatic Speech Recogniser (ASR), and then processed using NLP techniques to generate system responses. This paper reports a comprehensive comparison of dialogue policies trained using ASR-based transcriptions and extended with the aforementioned audio processing transformers in the DSTC2 task. Whilst our dialogue policies are trained with supervised and policy-based deep reinforcement learning, they are assessed using both automatic task completion metrics and a human evaluation. Our results reveal that using audio embeddings is more beneficial than detrimental in most of our trained dialogue policies, and that the benefits are stronger for supervised learning than reinforcement learning.

Colecciones

Artículos