Multi-view detection incorporates multiple camera views to alleviate occlusion in crowded scenes, where the state-of-the-art approaches adopt homography transformations to project multi-view features to the ground plane. However, we find that these 2D transformations do not take into account the object's height, and with this neglection features along the vertical direction of same object are likely not projected onto the same ground plane point, leading to impure ground-plane features. To solve this problem, we propose VFA, voxelized 3D feature aggregation, for feature transformation and aggregation in multi-view detection. Specifically, we voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels. This allows us to identify and then aggregate 2D features along the same vertical line, alleviating projection distortions to a large extent. Additionally, because different kinds of objects (human vs. cattle) have different shapes on the ground plane, we introduce the oriented Gaussian encoding to match such shapes, leading to increased accuracy and efficiency. We perform experiments on multiview 2D detection and multiview 3D detection problems. Results on four datasets (including a newly introduced MultiviewC dataset) show that our system is very competitive compared with the state-of-the-art approaches. %Our code and data will be open-sourced.Code and MultiviewC are released at https://github.com/Robert-Mar/VFA.
翻译:多视图探测包含多个相机视图,以缓解拥挤的场景中的封闭性。 在拥挤的场景中, 最先进的场景方法采用同质转换, 将多视图特性投射到地面平面上。 然而, 我们发现, 这些二维转换不考虑天体的高度, 而随着同一天体垂直方向的这种忽略特征, 可能不会投射到同一个地面平面点上, 导致地面平面特征不纯化。 为了解决这个问题, 我们提议 VFA, 将3D 特性混成为一体, 以便在多视图探测中进行特征转换和集成。 具体地说, 我们将3D 空间的3D 空间进行反毒转换, 将 voxel 投射到每个摄像视图上, 并将 2D 特性与这些预测的 voxels 相连接。 这样, 我们就可以在相同的垂直线上识别并集 2D 的 2D 特征, 将不同种类的物体( 人类对牛群), 我们引入了面向高斯/ 的编码来匹配这些形状, 提高准确度和效率。 我们在多维维的系统上进行实验 2D 将显示多维数据 3D 的多维 数据查看 显示 。 我们的系统将显示 将显示的多维的多维数据显示的数据 。 将显示 的多维 显示的多维 将显示的数据 。