Silent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947 is achievable only using one shot, and its performance can be further boosted by adaptively learning from more data. This generalizability allowed us to develop a mobile silent speech interface empowered with on-device fine-tuning and visual keyword spotting. A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme. Subjective feedback indicated that our system provides essential functionalities for customizable silent speech interactions with high usability and learnability.
翻译:静默语音界面是一种很有希望的技术,它能够使私人使用自然语言进行通信。然而,以往的方法只支持一个小型的、不灵活的词汇,这只能导致有限的表达性。我们利用对比式学习学习来学习高效的唇读式表达方式,从而能够以最小的用户努力来进行微小的指令定制。我们的模型显示,在动态数据集中,不同的照明、姿态和手势条件都非常可靠。对于25个指令分类来说,一个F1核心(0.8947)只能用一个镜头就可以实现,其性能可以通过从更多数据中适应性地学习来进一步提高。这种通用性使我们能够开发一个移动的静默语音界面,能够通过在设计性微调和视觉关键词定位上获得授权。用户研究表明,与LipLearner一起,用户可以以在线递增学习计划保证的高度可靠性来定义自己的命令。主观反馈表明,我们的系统为可定制的静语互动提供了必不可少的功能,高可用性和可学习性。