Active speaker detection (ASD) and virtual cinematography (VC) can significantly improve the remote user experience of a video conference by automatically panning, tilting and zooming of a video conferencing camera: users subjectively rate an expert video cinematographer's video significantly higher than unedited video. We describe a new automated ASD and VC that performs within 0.3 MOS of an expert cinematographer based on subjective ratings with a 1-5 scale. This system uses a 4K wide-FOV camera, a depth camera, and a microphone array; it extracts features from each modality and trains an ASD using an AdaBoost machine learning system that is very efficient and runs in real-time. A VC is similarly trained using machine learning to optimize the subjective quality of the overall experience. To avoid distracting the room participants and reduce switching latency the system has no moving parts -- the VC works by cropping and zooming the 4K wide-FOV video stream. The system was tuned and evaluated using extensive crowdsourcing techniques and evaluated on a dataset with N=100 meetings, each 2-5 minutes in length.
翻译:主动语音探测(ASD)和虚拟电影摄影机(VC)能够通过自动覆盖、倾斜和放大电视会议相机,大大改善电视会议的远程用户体验:用户主观评价专家视频摄影师的视频比未经编辑的视频高得多。我们描述的是一种新的自动的ASD和VC,根据1-5尺度的主观评分,在专家摄影师的0.3MOS内进行。该系统使用4K宽视野摄像头、深度摄像头和麦克风阵列;它从每种模式中提取特征,并使用高效和实时运行的AdaBoost机器学习系统进行ASD培训。一个VC同样受到培训,利用机器学习来优化总体经验的主观质量。为了避免分散会议室参与者的注意力和减少系统开关,该系统没有移动部分 -- -- VC通过裁剪裁和放大4K宽视野视频流,对该系统进行了调整和评价,并使用广泛的众包技术,每2-5分钟用N=100会议对数据集进行了评价。