序列场景剖析: 从平铺层的场景分类到像素的语义标签 (Aerial Scene Parsing: From Tile-level Scene Classification to Pixel-wise Semantic Labeling)

Given an aerial image, aerial scene parsing (ASP) targets to interpret the semantic structure of the image content, e.g., by assigning a semantic label to every pixel of the image. With the popularization of data-driven methods, the past decades have witnessed promising progress on ASP by approaching the problem with the schemes of tile-level scene classification or segmentation-based image analysis, when using high-resolution aerial images. However, the former scheme often produces results with tile-wise boundaries, while the latter one needs to handle the complex modeling process from pixels to semantics, which often requires large-scale and well-annotated image samples with pixel-wise semantic labels. In this paper, we address these issues in ASP, with perspectives from tile-level scene classification to pixel-wise semantic labeling. Specifically, we first revisit aerial image interpretation by a literature review. We then present a large-scale scene classification dataset that contains one million aerial images termed Million-AID. With the presented dataset, we also report benchmarking experiments using classical convolutional neural networks (CNNs). Finally, we perform ASP by unifying the tile-level scene classification and object-based image analysis to achieve pixel-wise semantic labeling. Intensive experiments show that Million-AID is a challenging yet useful dataset, which can serve as a benchmark for evaluating newly developed algorithms. When transferring knowledge from Million-AID, fine-tuning CNN models pretrained on Million-AID perform consistently better than those pretrained ImageNet for aerial scene classification. Moreover, our designed hierarchical multi-task learning method achieves the state-of-the-art pixel-wise classification on the challenging GID, bridging the tile-level scene classification toward pixel-wise semantic labeling for aerial image interpretation.

翻译：根据空中图像, 空中场景分析( ASP) 目标可以解释图像内容的语义结构, 例如, 通过给图像的每个像素指定一个语义标签。随着数据驱动方法的普及, 过去几十年在 ASP 上取得了大有希望的进展, 通过使用高分辨率的空中图像, 利用高分辨率的现场分类或分解图像分析方法来解决问题。然而, 前一个方案往往产生带有瓷砖状边界的结果, 而后一个方案则需要处理从像素到语义学的复杂模型, 这往往需要用像素般的语义像标注。在本文中,我们用高档级的图像分类方法来解决 ASP 这些问题。具体地说, 我们首先通过文献审查来重新研究空中图像判读。然后我们提出一个包含100万个名为百万AID 的大规模现场分类数据集。有了所提出的数据设置, 并且我们用高档级的直观的图像模型进行基准化实验, 并且我们用直观的直观的直径图像分析, 通过直观的直观的图像分析, 直观的直观的GLA- Salial- Salial- Salial 的图像分析, 。