Nowadays, there is a strong need to deploy the target speaker separation (TSS) model on mobile devices with a limitation of the model size and computational complexity. To better perform TSS for mobile voice communication, we first make a dual-channel dataset based on a specific scenario, LibriPhone. Specifically, to better mimic the real-case scenario, instead of simulating from the single-channel dataset, LibriPhone is made by simultaneously replaying pairs of utterances from LibriSpeech by two professional artificial heads and recording by two built-in microphones of the mobile. Then, we propose a lightweight time-frequency domain separation model, LSTM-Former, which is based on the LSTM framework with source-to-noise ratio (SI-SNR) loss. For the experiments on Libri-Phone, we explore the dual-channel LSTMFormer model and a single-channel version by a random single channel of Libri-Phone. Experimental result shows that the dual-channel LSTM-Former outperforms the single-channel LSTMFormer with relative 25% improvement. This work provides a feasible solution for the TSS task on mobile devices, playing back and recording multiple data sources in real application scenarios for getting dual-channel real data can assist the lightweight model to achieve higher performance.
翻译:目前,非常需要将目标扬声器分离模型(TSS)用于移动设备,其模型尺寸和计算复杂度有限。为了更好地运行TSS用于移动语音通信,我们首先根据特定的情景,即LibriPhone,制作双通道数据集。具体地说,为了更好地模拟真实情景,而不是从单一通道数据集模拟,LibriPhone是由两个专业的人工头头同时播放LibriSpeach的双声带,用两部固定的移动麦克风进行记录。然后,我们提出一个轻型时间频域分离模型(LSTM-Former)模型,这个模型基于LSTM框架框架,有源对声率(SI-SNR)损失。关于Libri-Phone的实验,我们探索双声带LSTMFormer模型,以及一个随机的LibriPhone的单一频道的单声带版本。实验结果显示,双声带LSTM-Formeld-rome域域分离模型(LTM-Form-Formperterrophen) 域分离模型,这个高频系统将实时版本用于实时版本用于实时存储系统,并进行双向双向双向双向双向双声带数据操作操作操作操作的移动操作操作。