Content creators often use music to enhance their stories, as it can be a powerful tool to convey emotion. In this paper, our goal is to help creators find music to match the emotion of their story. We focus on text-based stories that can be auralized (e.g., books), use multiple sentences as input queries, and automatically retrieve matching music. We formalize this task as a cross-modal text-to-music retrieval problem. Both the music and text domains have existing datasets with emotion labels, but mismatched emotion vocabularies prevent us from using mood or emotion annotations directly for matching. To address this challenge, we propose and investigate several emotion embedding spaces, both manually defined (e.g., valence/arousal) and data-driven (e.g., Word2Vec and metric learning) to bridge this gap. Our experiments show that by leveraging these embedding spaces, we are able to successfully bridge the gap between modalities to facilitate cross modal retrieval. We show that our method can leverage the well established valence-arousal space, but that it can also achieve our goal via data-driven embedding spaces. By leveraging data-driven embeddings, our approach has the potential of being generalized to other retrieval tasks that require broader or completely different vocabularies.
翻译:内容创建者经常使用音乐来提升其故事, 因为它可以成为传递情感的强大工具。 在本文中, 我们的目标是帮助创作者找到音乐来匹配其故事的情感。 我们侧重于基于文本的故事, 这些故事可以具有分化性( 例如书籍), 使用多个句子作为输入查询, 并自动检索匹配音乐 。 我们将此任务正式化为跨模式文本到音乐的检索问题 。 音乐和文本域都拥有情感标签, 但是不匹配的情感词汇库阻止我们直接使用情绪或情感说明来匹配。 为了应对这一挑战, 我们提议并调查几个情感嵌入空间, 它们是手动定义的( 比如, valence/ rausal) 和数据驱动的( 比如, Word2Vec 和 度量度学习 ), 以弥合这一差距。 我们的实验显示, 通过利用这些嵌入空间, 我们能够成功地弥合模式之间的差距, 以便利跨模式的检索。 我们证明我们的方法可以利用成熟的价值观空间, 直接匹配匹配。 但是, 我们也可以实现我们的目标, 通过数据嵌入空间的通用性定位, 也可以实现我们的目标。