Evaluation of keyword spotting (KWS) systems that detect keywords in speech is a challenging task under realistic privacy constraints. The KWS is designed to only collect data when the keyword is present, limiting the availability of hard samples that may contain false negatives, and preventing direct estimation of model recall from production data. Alternatively, complementary data collected from other sources may not be fully representative of the real application. In this work, we propose an evaluation technique which we call AB/BA analysis. Our framework evaluates a candidate KWS model B against a baseline model A, using cross-dataset offline decoding for relative recall estimation, without requiring negative examples. Moreover, we propose a formulation with assumptions that allow estimation of relative false positive rate between models with low variance even when the number of false positives is small. Finally, we propose to leverage machine-generated soft labels, in a technique we call Semi-Supervised AB/BA analysis, that improves the analysis time, privacy, and cost. Experiments with both simulation and real data show that AB/BA analysis is successful at measuring recall improvement in conjunction with the trade-off in relative false positive rate.
翻译:在现实的隐私限制下,检测语音中关键词的识别关键词(KWS)系统的评价是一项艰巨的任务。KWS的设计仅是为了在关键字出现时收集数据,限制可能含有虚假负数的硬样本的可用性,防止直接估计从生产数据中收回模型。或者,从其他来源收集的补充数据可能不完全代表实际应用。在这项工作中,我们建议一种我们称之为AB/BA分析的评价技术。我们的框架根据基线模型A对候选KWS模型B进行评估,使用交叉数据从线下解码进行相对回溯估计,而不需要负面例子。此外,我们提出一种假设,允许对低差异模型之间相对正数的估计。最后,我们提议利用机器生成的软标签,即我们称之为半超模AB/BA分析的技术,改进分析时间、隐私和成本。模拟和真实数据实验表明AB/B分析成功地衡量了与相对正率交易的回溯性改进。