State-of-the-art separation of desired signal components (DSCs) from a mixture is achieved using time-frequency masks or filters estimated by a deep neural network (DNN). The DSCs are typically defined at the time of training, or alternatively during inference via a reference signal (RS). In the latter case, typically, an auxiliary DNN extracts signal characteristics (SCs) from the RS and estimates a set of adaptive weights (AWs) of the first DNN. In both cases, the information of DSCs is stored in the DNN weights. Current methods using audio RSs estimate time-invariant AWs. Applications where the RS and DSCs exhibit time-variant SCs, i.e., they cannot be assigned to a specific class like speech, require time-variant AWs. An example is acoustic echo cancellation with the loudspeaker signal as RS. We propose a method to extract time-variant AWs from a RS and additionally show that current time-invariant AWs methods can be employed for universal source separation. To avoid strong scaling between the estimate and the mixture, we propose to train with the dual scale-invariant signal-to-distortion ratio in a TASNET inspired DNN. We evaluate the proposed AWs systems under various acoustic conditions and show the scenario-dependent advantages of time-variant over time-invariant AWs.
翻译:使用由深神经网络估计的时间频面罩或过滤器对理想信号部件(DSC)与混合物进行最先进的分离。DSC通常在培训时或通过参考信号(RS)推断时加以界定。在后一种情况下,辅助DNN通常从RS提取信号特性(SC),并估计第一个DNN的一组适应性重量(AWs)。在这两种情况下,DNN重量中储存DS的信息。目前使用有声的RS估计时间变量 AW的方法。在RS和DSC展示时间-变量SC(即,无法在通过参考信号(RS)进行推断时段定义,或者在通过参考信号(RS)进行推断时段推断时段定义。在后一种情况下,辅助DNNNNN通常会从RS提取一套调频信号(Ws),并估计第一套调频信号(Ws),我们提议一种方法从RS提取时间变量AWs的重量。目前使用的时间变量方法,在通用源分离时段中可以使用。应用,如果RS和DSC展示双轨(我们用双轨)对A-NA-NA的信号进行升级,那么,我们提议在双轨间测测测测测测测算。