As the scene information, including objectness and scene type, are important for people with visual impairment, in this work we present a multi-task efficient perception system for the scene parsing and recognition tasks. Building on the compact ResNet backbone, our designed network architecture has two paths with shared parameters. In the structure, the semantic segmentation path integrates fast attention, with the aim of harvesting long-range contextual information in an efficient manner. Simultaneously, the scene recognition path attains the scene type inference by passing the semantic features into semantic-driven attention networks and combining the semantic extracted representations with the RGB extracted representations through a gated attention module. In the experiments, we have verified the systems' accuracy and efficiency on both public datasets and real-world scenes. This system runs on a wearable belt with an Intel RealSense LiDAR camera and an Nvidia Jetson AGX Xavier processor, which can accompany visually impaired people and provide assistive scene information in their navigation tasks.
翻译:由于场景信息,包括对象和场景类型,对于视障者非常重要,在这项工作中,我们为场景的剖析和识别任务展示了一个多任务高效感知系统。在紧凑的ResNet主干线上,我们设计的网络结构有两条路径,有共同参数。在结构中,语义分割路径将快速关注结合起来,目的是以高效的方式收集长距离的背景资料。同时,场景识别路径通过将语义特征传送到语义驱动的注意网络中,将语义提取的表示与RGB提取的表示组合在一起,通过一个门外注意模块,将语义提取的表示与RGB提取的表示结合起来,从而达到场景类型。在实验中,我们核实了公共数据集和真实世界场景的系统准确性和效率。这个系统运行在可磨损带上,有Intel RealSense LiDAR相机和Nvidia Jetson AGXX Xavier处理器,可以陪伴视力受损者,并在导航任务中提供辅助场景信息。