3D audio-visual production aims to deliver immersive and interactive experiences to the consumer. Yet, faithfully reproducing real-world 3D scenes remains a challenging task. This is partly due to the lack of available datasets enabling audio-visual research in this direction. In most of the existing multi-view datasets, the accompanying audio is neglected. Similarly, datasets for spatial audio research primarily offer unimodal content, and when visual data is included, the quality is far from meeting the standard production needs. We present "Tragic Talkers", an audio-visual dataset consisting of excerpts from the "Romeo and Juliet" drama captured with microphone arrays and multiple co-located cameras for light-field video. Tragic Talkers provides ideal content for object-based media (OBM) production. It is designed to cover various conventional talking scenarios, such as monologues, two-people conversations, and interactions with considerable movement and occlusion, yielding 30 sequences captured from a total of 22 different points of view and two 16-element microphone arrays. Additionally, we provide voice activity labels, 2D face bounding boxes for each camera view, 2D pose detection keypoints, 3D tracking data of the mouth of the actors, and dialogue transcriptions. We believe the community will benefit from this dataset as it can assist multidisciplinary research. Possible uses of the dataset are discussed.
翻译:3D视听制作的目的是向消费者提供沉浸和互动的经验。然而,忠实复制真实世界的3D场景仍是一项艰巨的任务,部分原因是缺乏便于朝此方向进行视听研究的现有数据集。在大多数现有的多视图数据集中,相关的音频被忽略。同样,空间音频研究的数据集主要提供单式内容,当包含视觉数据时,质量远未达到标准生产需求。我们展示了由“Romeo和Juliet”戏剧节选集的视听数据集,其中有“Romeo和Juliet”戏剧的节录,并配有麦克风阵列和多个合用相机拍摄光场视频。Tragic Talers为基于目标的媒体制作提供了理想内容。它旨在覆盖各种传统的谈话情景,如独白、两人对话、与大量移动和封闭的互动,共从22个不同的观点点和两个16个组合的麦克风阵列中采集了30个序列。此外,我们提供声频活动标签、2D面面图像跟踪工具,作为每个摄像头的关键对话工具,我们将相信每个数据记录框的语音记录框。