Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also explore zero-shot classification and show that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a large margin. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. Code and pre-trained models are available at https://github.com/atonderski/lidarclip.
翻译:将文字和图像连接起来的研究最近取得了一些突破,例如CLIP、DALL-E 2和稳定传播等模型。然而,文本与Lidar数据等其他视觉模式之间的联系受到较少的关注,因为缺少文本-lidar数据集而被禁止。在这项工作中,我们提议LidarCLIP,从汽车点云到原有的CLIP嵌入空间的绘图。我们使用图像-lidar对相框,监督一个点云码,图像CLIP嵌入,有效地将文本和Lidar数据与图像域作为中间线连接。我们通过显示基于LidarCLCPIP的检索通常与基于图像的检索相同,但具有互补的优势和弱点,来显示Lidar CLLIP的有效性。我们通过将图像和Lidartal功能结合起来,改进单一模式的方法,并能够有针对性地搜索在不利传感器条件下具有挑战性的探测情景。我们还探索零光分分类,并显示Ldar CLIP比目前试图在大边缘使用CLIP进行点云的尝试。我们利用CLIP的功能搜索范围。最后与CLLLIIP的CLIP 。我们在CLIBDARDIP 和FDIP上探索了生成模型。我们可以使用的额外应用范围探索。</s>