Keyword Spotting (KWS) is an essential component in a smart device for alerting the system when a user prompts it with a command. As these devices are typically constrained by computational and energy resources, the KWS model should be designed with a small footprint. In our previous work, we developed lightweight dynamic filters which extract a robust feature map within a noisy environment. The learning variables of the dynamic filter are jointly optimized with KWS weights by using Cross-Entropy (CE) loss. CE loss alone, however, is not sufficient for high performance when the SNR is low. In order to train the network for more robust performance in noisy environments, we introduce the LOw Variant Orthogonal (LOVO) loss. The LOVO loss is composed of a triplet loss applied on the output of the dynamic filter, a spectral norm-based orthogonal loss, and an inner class distance loss applied in the KWS model. These losses are particularly useful in encouraging the network to extract discriminatory features in unseen noise environments.
翻译:关键词 Spoteting (KWS) 是用户在命令下提示系统时提醒系统的一个智能设备中的一个基本组成部分。 由于这些设备通常受到计算和能源资源的制约, KWS 模型的设计应该有一个小脚印。 在先前的工作中, 我们开发了轻量的动态过滤器, 在噪音环境中提取一个强健的特征地图。 动态过滤器的学习变量通过使用 Cross- Entropy (CE) 损失与 KWS 重量共同优化。 然而, 单是 CE 损失不足以在 SNR 低时产生高性能。 为了在噪音环境中对网络进行更强性能的培训, 我们引入了 Low 备选 Orthogonal (LOVO) 损失。 LOVO 损失是由在动态过滤器输出上应用的三重损失、 基于光谱的或直位损失和KWS 模型应用的内级距离损失构成的三重损失。 这些损失对于鼓励网络在看不见噪音环境中产生歧视性特征尤其有用。