Labeling and maintaining a commercial sound effects library is a time-consuming task exacerbated by databases that continually grow in size and undergo taxonomy updates. Moreover, sound search and taxonomy creation are complicated by non-uniform metadata, an unrelenting problem even with the introduction of a new industry standard, the Universal Category System. To address these problems and overcome dataset-dependent limitations that inhibit the successful training of deep learning models, we pursue representation learning to train generalized embeddings that can be used for a wide variety of sound effects libraries and are a taxonomy-agnostic representation of sound. We show that a task-specific but dataset-independent representation can successfully address data issues such as class imbalance, inconsistent class labels, and insufficient dataset size, outperforming established representations such as OpenL3. Detailed experimental results show the impact of metric learning approaches and different cross-dataset training methods on representational effectiveness.
翻译:标签和维护商业声效图书馆是一项耗时的工作,而数据库在规模和分类更新方面不断增长,使这项工作更加耗时。此外,健全的搜索和分类创建由于非统一的元数据而变得复杂,即使采用新的行业标准通用分类系统,也是一个持续不懈的问题。为了解决这些问题并克服妨碍成功培训深层学习模式的依赖于数据集的局限性,我们开展了代表性学习,以培训可用于各种声效图书馆的通用嵌入,并成为声学的分类学-不可知性代表。我们表明,任务特定但依赖数据集的代表性能够成功地解决诸如阶级不平衡、前后不一致的类标签和数据集大小不足等数据问题,而且表现优于OpenL3等既定表述,详细实验结果显示了计量学习方法和不同交叉数据集培训方法对代表性效果的影响。