3D human pose estimation in multi-view operating room (OR) videos is a relevant asset for person tracking and action recognition. However, the surgical environment makes it challenging to find poses due to sterile clothing, frequent occlusions, and limited public data. Methods specifically designed for the OR are generally based on the fusion of detected poses in multiple camera views. Typically, a 2D pose estimator such as a convolutional neural network (CNN) detects joint locations. Then, the detected joint locations are projected to 3D and fused over all camera views. However, accurate detection in 2D does not guarantee accurate localisation in 3D space. In this work, we propose to directly optimise for localisation in 3D by training 2D CNNs end-to-end based on a 3D loss that is backpropagated through each camera's projection parameters. Using videos from the MVOR dataset, we show that this end-to-end approach outperforms optimisation in 2D space.
翻译:多视图操作室(OR)视频中的3D人构成估计是个人跟踪和行动识别的一个相关资产。然而,外科环境使得难以找到由于消毒衣物、频繁隔离和有限的公共数据造成的外形。专门为OR设计的方法通常是以多摄像头视图中检测到的外形组合为基础。通常,2D构成的外形估测器,如神经神经网络(CNN)检测到联合位置。然后,所探测到的联合位置被预测为3D,并被所有相机视图引信引信。然而,2D的准确检测并不能保证3D空间的准确定位。在这项工作中,我们提议直接优化3D本地化,通过培训2DCNN的终端到终端,通过每台摄像头的投影参数进行反射。我们用MVOR数据集的视频显示,这种端对端方法比2D空间的优化效果要好。