Voice assistants have become an essential tool for people with various disabilities because they enable complex phone- or tablet-based interactions without the need for fine-grained motor control, such as with touchscreens. However, these systems are not tuned for the unique characteristics of individuals with speech disorders, including many of those who have a motor-speech disorder, are deaf or hard of hearing, have a severe stutter, or are minimally verbal. We introduce an alternative voice-based input system which relies on sound event detection using fifteen nonverbal mouth sounds like "pop," "click," or "eh." This system was designed to work regardless of ones' speech abilities and allows full access to existing technology. In this paper, we describe the design of a dataset, model considerations for real-world deployment, and efforts towards model personalization. Our fully-supervised model achieves segment-level precision and recall of 88.6% and 88.4% on an internal dataset of 710 adults, while achieving 0.31 false positives per hour on aggressors such as speech. Five-shot personalization enables satisfactory performance in 84.5% of cases where the generic model fails.
翻译:语音助听器已成为各种残疾人的基本工具,因为他们能够使复杂的电话或平板电脑互动,而不需要精细的发动机控制,例如触摸屏。然而,这些系统没有适应有言语障碍的人,包括许多有语音障碍、耳聋或听力困难的人、聋哑或听力困难的人、有严重口交的人,或言语很少的人的独特性。我们引入了一种基于声音的替代输入系统,它依靠15个非口头的口音听起来像“pop”、“点击”或“eh”这样的声音来检测声音。这个系统的设计是不管他们说话能力如何都能工作,并允许他们充分利用现有技术。在本文件中,我们描述了一套数据集的设计、真实世界部署的模型以及个人化模型。我们完全监督的模型在710个成年人的内部数据集上达到部分精确度,记得88.6%和88.4%,同时在像演讲这样的攻击者身上每小时达到0.31个假的正数。在84.5%的通用模型失败的情况下,五发个人化使84.5%的病例的成绩令人满意。