Aerial scene recognition is a fundamental research problem in interpreting high-resolution aerial imagery. Over the past few years, most studies focus on classifying an image into one scene category, while in real-world scenarios, it is more often that a single image contains multiple scenes. Therefore, in this paper, we investigate a more practical yet underexplored task -- multi-scene recognition in single images. To this end, we create a large-scale dataset, called MultiScene, composed of 100,000 unconstrained high-resolution aerial images. Considering that manually labeling such images is extremely arduous, we resort to low-cost annotations from crowdsourcing platforms, e.g., OpenStreetMap (OSM). However, OSM data might suffer from incompleteness and incorrectness, which introduce noise into image labels. To address this issue, we visually inspect 14,000 images and correct their scene labels, yielding a subset of cleanly-annotated images, named MultiScene-Clean. With it, we can develop and evaluate deep networks for multi-scene recognition using clean data. Moreover, we provide crowdsourced annotations of all images for the purpose of studying network learning with noisy labels. We conduct experiments with extensive baseline models on both MultiScene-Clean and MultiScene to offer benchmarks for multi-scene recognition in single images and learning from noisy labels for this task, respectively. To facilitate progress, we make our dataset and trained models available on https://github.com/Hua-YS/Multi-Scene-Recognition.
翻译:在翻译高分辨率航空图像时, 直观的现场识别是一个根本性的研究问题。 在过去几年中, 多数研究的重点是将图像分类为一个场景类别, 而现实世界情景中, 更常见的是单个图像包含多个场景。 因此, 在本文中, 我们调查一项更实用但探索不足的任务, 即多层图像识别。 为此, 我们创建了一个大型数据集, 叫做多层屏幕, 由10万个未受限制的高分辨率空中图像组成。 考虑到手动标注这些图像极为艰苦, 我们使用众包平台, 例如 OpenStretMap (OSM) 低成本的注释。 然而, OSM 数据可能会因不完全和不正确而受到影响, 而在图像标签中引入噪音。 为了解决这个问题, 我们用视觉检查14000个图像并纠正其场景标签, 产生一组干净的附加说明的图像, 名为多层Sceen- Clean。 我们可以开发并评估用于多层识别的深层网络网络, 使用清洁的数据模型。 此外, 我们用经过熟化的图像模型, 提供高层网络, 来进行高清晰的图像测试, 用于浏览的网络 和多层数据库 。