Speaker counting is the task of estimating the number of people that are simultaneously speaking in an audio recording. For several audio processing tasks such as speaker diarization, separation, localization and tracking, knowing the number of speakers at each timestep is a prerequisite, or at least it can be a strong advantage, in addition to enabling a low latency processing. In a previous work, we addressed the speaker counting problem with a multichannel convolutional recurrent neural network which produces an estimation at a short-term frame resolution. In this work, we show that, for a given frame, there is an optimal position in the input sequence for best prediction accuracy. We empirically demonstrate the link between that optimal position, the length of the input sequence and the size of the convolutional filters.
翻译:计算议长是估计同时在录音中发言的人数的任务。对于一些音频处理任务,如发言者的分化、分离、本地化和跟踪等,知道每个时段的发言者人数是一个先决条件,或者至少它可以是一个强大的优势,除了能够进行低延时处理之外,还能够使低延时段处理。在以前的一项工作中,我们用一个多频道的循环循环神经网络来讨论发言者的计数问题,这个网络可以得出短期解析的估计。在这项工作中,我们表明,对于一个特定框架,输入序列有一个最佳位置,以便作出最佳的预测准确性。我们从经验上证明了这种最佳位置、输入序列的长度和革命过滤器的大小之间的联系。