One of the fundamental functionalities for accepting a socially assistive robot is its communication capabilities with other agents in the environment. In the context of the ROBIN project, situational dialogue through voice interaction with a robot was investigated. This paper presents different speech recognition experiments with deep neural networks focusing on producing fast (under 100ms latency from the network itself), while still reliable models. Even though one of the key desired characteristics is low latency, the final deep neural network model achieves state of the art results for recognizing Romanian language, obtaining a 9.91% word error rate (WER), when combined with a language model, thus improving over the previous results while offering at the same time an improved runtime performance. Additionally, we explore two modules for correcting the ASR output (hyphen and capitalization restoration and unknown words correction), targeting the ROBIN project's goals (dialogue in closed micro-worlds). We design a modular architecture based on APIs allowing an integration engine (either in the robot or external) to chain together the available modules as needed. Finally, we test the proposed design by integrating it in the RELATE platform and making the ASR service available to web users by either uploading a file or recording new speech.
翻译:接受社会辅助机器人的基本功能之一是其与环境中其他代理商的通信能力。在ROBIN项目中,通过与机器人的语音互动进行情况对话。本文件介绍了与深神经网络的不同语音识别实验,侧重于快速生产(低于100米的距离,与网络本身的悬浮度为100米以下),而模型仍然可靠。尽管关键的理想特征之一是低悬浮度,但最后深神经网络模型在承认罗马尼亚语言方面达到了最新的结果,获得9.91%的单词错误率(WER),如果与语言模型相结合,从而改进了先前的结果,同时提供了改进的运行时间性能。此外,我们探索了两个模块,用于纠正ASR输出(断线和资本化恢复以及未知的单词校正),以ROBIN项目的目标为目标(闭路微型世界的对话),我们设计了一个模块架构,根据API允许(机器人或外部)将可用的模块连锁在一起。最后,我们测试了拟议的设计,方法是将其整合到 RELATET 语音平台中,或者通过上传新的服务器向网络用户提供ASR服务。