Research connecting text and images has recently seen several breakthroughs, with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection between text and other visual modalities, such as lidar data, has received less attention, prohibited by the lack of text-lidar datasets. In this work, we propose LidarCLIP, a mapping from automotive point clouds to a pre-existing CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder with the image CLIP embeddings, effectively relating text and lidar data with the image domain as an intermediary. We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is generally on par with image-based retrieval, but with complementary strengths and weaknesses. By combining image and lidar features, we improve upon both single-modality methods and enable a targeted search for challenging detection scenarios under adverse sensor conditions. We also use LidarCLIP as a tool to investigate fundamental lidar capabilities through natural language. Finally, we leverage our compatibility with CLIP to explore a range of applications, such as point cloud captioning and lidar-to-image generation, without any additional training. We hope LidarCLIP can inspire future work to dive deeper into connections between text and point cloud understanding. Code and trained models available at https://github.com/atonderski/lidarclip.
翻译:将文本和图像连接起来的研究最近取得了一些突破,例如CLIP、DALL-E 2和Sclast Difulation等模型。然而,文本与其他视觉模式(如Lidar Data)之间的联系受到较少的关注,因为缺少文本-lidar数据集而被禁止。在这项工作中,我们提议LidarCLIP,即从汽车点云到原有的CLIP嵌入空间的地图绘制LidarCLIP。我们使用图像-lidard 配对,监督一个带有图像 CLIP嵌入的点云码编码器,有效地将文本和Lidar数据与图像域作为中间媒介。我们展示了LdarCLIP的效力,展示了基于Lidar的检索通常与基于图像的检索相同,但具有互补的优势和弱点。我们通过将图像和Lidar CLOCLIP功能结合起来,改进了单一模式,并能够有针对性地搜索在不利的传感器条件下具有挑战性的探测情景。我们还使用LdardCLIP作为工具,通过自然语言调查基本的LDARD的能力。最后,我们与CLIP的兼容性可以利用我们与CLIP 探索一个应用范围,如CLDLDLDLDLDLDLDARDARD/LDARDARD 和未来生成任何可能的图像/LDLDAR/LODAR 之间的连接。我们可以鼓励。