We introduce PodcastMix, a dataset formalizing the task of separating background music and foreground speech in podcasts. We aim at defining a benchmark suitable for training and evaluating (deep learning) source separation models. To that end, we release a large and diverse training dataset based on programatically generated podcasts. However, current (deep learning) models can incur into generalization issues, specially when trained on synthetic data. To target potential generalization issues, we release an evaluation set based on real podcasts for which we design objective and subjective tests. Out of our experiments with real podcasts, we find that current (deep learning) models may have generalization issues. Yet, these can perform competently, e.g., our best baseline separates speech with a mean opinion score of 3.84 (rating "overall separation quality" from 1 to 5). The dataset and baselines are accessible online.
翻译:我们引入了Podcast Mix, 这是一个将背景音乐和前景演讲在播客中分开的任务正式化的数据集。 我们的目标是为培训和评估(深学习)源分离模型制定适合的基准。 为此,我们发布基于程序生成播客的大型和多样的培训数据集。 然而,当前(深学习)模型可能会引发一般化问题, 特别是在进行合成数据培训时。 为了针对潜在的概括问题, 我们发布一套基于真实播客的评价集, 我们设计了这些播客的客观和主观测试。 我们通过用真实播客的实验发现, 当前的(深学习)模型可能存在概括化问题。 然而, 这些模型可以胜任地运行, 比如, 我们最好的基线独立演讲, 平均评分为3.84( 将“ 总体分离质量” 从1到 5 ) 。 数据集和基线可以在网上查阅 。