Remote sensing scene classification plays a key role in Earth observation by enabling the automatic identification of land use and land cover (LULC) patterns from aerial and satellite imagery. Despite recent progress with convolutional neural networks (CNNs) and vision transformers (ViTs), the task remains challenging due to variations in spatial resolution, viewpoint, orientation, and background conditions, which often reduce the generalization ability of existing models. To address these challenges, this paper proposes a lightweight architecture based on the convolutional mixer paradigm. The model alternates between spatial mixing through depthwise convolutions at multiple scales and channel mixing through pointwise operations, enabling efficient extraction of both local and contextual information while keeping the number of parameters and computations low. Extensive experiments were conducted on the AID and EuroSAT benchmarks. The proposed model achieved overall accuracy, average accuracy, and Kappa values of 74.7%, 74.57%, and 73.79 on the AID dataset, and 93.90%, 93.93%, and 93.22 on EuroSAT, respectively. These results demonstrate that the proposed approach provides a good balance between accuracy and efficiency compared with widely used CNN- and transformer-based models. Code will be publicly available on: https://github.com/mqalkhatib/SceneMixer
翻译:遥感场景分类通过从航空和卫星图像中自动识别土地利用与土地覆盖(LULC)模式,在地球观测中发挥着关键作用。尽管卷积神经网络(CNNs)和视觉变换器(ViTs)已取得进展,但由于空间分辨率、视角、方向和背景条件的变化,该任务仍具挑战性,这些因素常降低现有模型的泛化能力。为应对这些挑战,本文提出一种基于卷积混合器范式的轻量级架构。该模型通过多尺度深度卷积进行空间混合,并通过逐点操作进行通道混合,两者交替执行,从而在保持参数量和计算量较低的同时,有效提取局部和上下文信息。在AID和EuroSAT基准数据集上进行了广泛实验。所提模型在AID数据集上取得了74.7%的总精度、74.57%的平均精度和73.79的Kappa值,在EuroSAT数据集上分别达到93.90%、93.93%和93.22。结果表明,与广泛使用的基于CNN和变换器的模型相比,所提方法在精度与效率之间实现了良好平衡。代码将公开于:https://github.com/mqalkhatib/SceneMixer