Gestures performed accompanying the voice are essential for voice interaction to convey complementary semantics for interaction purposes such as wake-up state and input modality. In this paper, we investigated voice-accompanying hand-to-face (VAHF) gestures for voice interaction. We targeted hand-to-face gestures because such gestures relate closely to speech and yield significant acoustic features (e.g., impeding voice propagation). We conducted a user study to explore the design space of VAHF gestures, where we first gathered candidate gestures and then applied a structural analysis to them in different dimensions (e.g., contact position and type), outputting a total of 8 VAHF gestures with good usability and least confusion. To facilitate VAHF gesture recognition, we proposed a novel cross-device sensing method that leverages heterogeneous channels (vocal, ultrasound, and IMU) of data from commodity devices (earbuds, watches, and rings). Our recognition model achieved an accuracy of 97.3% for recognizing 3 gestures and 91.5% for recognizing 8 gestures, excluding the "empty" gesture, proving the high applicability. Quantitative analysis also sheds light on the recognition capability of each sensor channel and their different combinations. In the end, we illustrated the feasible use cases and their design principles to demonstrate the applicability of our system in various scenarios.
翻译:声音伴随的手至脸手势对于语音交互来说非常重要,因为这种手势通常被用于传达各种信息,如唤醒状态和输入方式等。本文针对声音交互中的手至脸手势进行了研究,因为这种手势与语音紧密相关,并产生显著的声学特征(例如阻碍声音传播)。我们进行了一项用户研究,探索了手至脸手势的设计空间,首先收集了候选手势,然后在不同维度(例如接触位置和类型)上进行了结构分析,总共输出了8种具有良好可用性和最小混淆的声音伴随手至脸手势。为了促进声音伴随手至脸手势的识别,我们提出了一种新的跨设备感应方法,利用来自商品设备(耳塞、手表和戒指)的异构信道(语音、超声波和IMU)的数据。我们的识别模型在识别3个手势时达到了97.3%的准确率,在识别8个手势(不包括“空”手势)时达到了91.5%的准确率,证明了其高度适用性。定量分析还揭示了每个传感器通道及其不同组合的识别能力。最后,我们阐述了可行的使用案例及其设计原则,以展示我们的系统在各种场景中的适用性。