Speaker identification (SID) in the household scenario (e.g., for smart speakers) is an important but challenging problem due to limited number of labeled (enrollment) utterances, confusable voices, and demographic imbalances. Conventional speaker recognition systems generalize from a large random sample of speakers, causing the recognition to underperform for households drawn from specific cohorts or otherwise exhibiting high confusability. In this work, we propose a graph-based semi-supervised learning approach to improve household-level SID accuracy and robustness with locally adapted graph normalization and multi-signal fusion with multi-view graphs. Unlike other work on household SID, fairness, and signal fusion, this work focuses on speaker label inference (scoring) and provides a simple solution to realize household-specific adaptation and multi-signal fusion without tuning the embeddings or training a fusion network. Experiments on the VoxCeleb dataset demonstrate that our approach consistently improves the performance across households with different customer cohorts and degrees of confusability.
翻译:在家庭情况中(例如,智能发言者),发言人身份识别是一个重要但具有挑战性的问题,因为贴有标签的(扩音)言论数量有限,声音可互换,人口不平衡。常规发言者识别系统从大量随机抽样发言者中泛泛而知,导致对来自特定组群的家庭或表现出高度易解性的家庭的认知不佳。在这项工作中,我们提议采用基于图表的半监督的学习方法,提高家庭一级SID的准确性和稳健性,使其与当地调整的图形正常化和多视图的多信号融合相结合。 与关于家庭SID、公平和信号融合的其他工作不同,这项工作侧重于语音标签推断(标码),并提供简单的解决办法,以实现家庭特有的适应和多信号融合,而不调整嵌入或培训一个聚变网络。对VoxCeleb数据集的实验表明,我们的方法始终在提高不同客户群和可调度的家庭的性能。