We present a learning-based approach for generating binaural audio from mono audio using multi-task learning. Our formulation leverages additional information from two related tasks: the binaural audio generation task and the flipped audio classification task. Our learning model extracts spatialization features from the visual and audio input, predicts the left and right audio channels, and judges whether the left and right channels are flipped. First, we extract visual features using ResNet from the video frames. Next, we perform binaural audio generation and flipped audio classification using separate subnetworks based on visual features. Our learning method optimizes the overall loss based on the weighted sum of the losses of the two tasks. We train and evaluate our model on the FAIR-Play dataset and the YouTube-ASMR dataset. We perform quantitative and qualitative evaluations to demonstrate the benefits of our approach over prior techniques.
翻译:我们提出一种学习方法,利用多任务学习,从单音中生成双声传音。我们的配方利用两个相关任务的额外信息:双声传音和翻转音频分类任务。我们的学习模式从视觉和音频输入中提取空间化特征,预测左、右音道和判断左、右音道是否被翻转。首先,我们利用视频框架的 RESNet 提取视觉特征。接下来,我们利用基于视觉特征的不同子网络进行双声传音和翻转音频分类。我们的学习方法根据两项任务损失的加权总和优化总体损失。我们在FAIR-Play数据集和YouTube-ASMR数据集上培训和评估我们的模型。我们进行了定量和定性评估,以展示我们方法对以往技术的好处。