The ability to choose an appropriate camera view among multiple cameras plays a vital role in TV shows delivery. But it is hard to figure out the statistical pattern and apply intelligent processing due to the lack of high-quality training data. To solve this issue, we first collect a novel benchmark on this setting with four diverse scenarios including concerts, sports games, gala shows, and contests, where each scenario contains 6 synchronized tracks recorded by different cameras. It contains 88-hour raw videos that contribute to the 14-hour edited videos. Based on this benchmark, we further propose a new approach temporal and contextual transformer that utilizes clues from historical shots and other views to make shot transition decisions and predict which view to be used. Extensive experiments show that our method outperforms existing methods on the proposed multi-camera editing benchmark.
翻译:在多个摄像头中选择适当的摄像视图的能力在电视节目播放中发挥着关键作用。 但是,由于缺乏高质量的培训数据,很难找出统计模式和应用智能处理。 为了解决这个问题,我们首先从这一背景中收集一个新的基准,有四种不同的情景,包括音乐会、体育比赛、演出和竞赛,其中每种情景都包含由不同相机记录的6个同步轨道。它包含88小时的原始视频,有助于14小时编辑的视频。基于这一基准,我们进一步建议采用新的时间和背景变压器,利用历史镜头和其他观点的线索做出过渡决定,并预测将使用哪些观点。广泛的实验表明,我们的方法在多镜头编辑基准上优于现有的方法。