What is an image and how to extract latent features? Convolutional Networks (ConvNets) consider an image as organized pixels in a rectangular shape and extract features via convolutional operation in local region; Vision Transformers (ViTs) treat an image as a sequence of patches and extract features via attention mechanism in a global range. In this work, we introduce a straightforward and promising paradigm for visual representation, which is called Context Clusters. Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm. In detail, each point includes the raw feature (e.g., color) and positional information (e.g., coordinates), and a simplified clustering algorithm is employed to group and extract deep features hierarchically. Our CoCs are convolution- and attention-free, and only rely on clustering algorithm for spatial interaction. Owing to the simple design, we show CoCs endow gratifying interpretability via the visualization of clustering process. Our CoCs aim at providing a new perspective on image and visual representation, which may enjoy broad applications in different domains and exhibit profound insights. Even though we are not targeting SOTA performance, COCs still achieve comparable or even better results than ConvNets or ViTs on several benchmarks. Codes are available at: https://github.com/ma-xu/Context-Cluster.
翻译:图像和如何提取潜伏特征? 进化网络(Convill Nets)将图像视为以矩形形状组成的有组织像素,并通过本地区域的进化操作提取特征; 视觉变异器(View Trangers)通过全球范围的注意机制将图像视为补丁和提取特征的序列。 在这项工作中,我们为视觉代表引入了一个直观和有希望的模式,称为“背景群集 ” 。 背景群集(CoCs)将图像视为一组非组织点和通过简化群集算法提取特征。 详细来说,每个点包括原始特征(例如,颜色)和定位信息(例如,坐标)以及一个简化的组合算法,用于按等级组合和提取深度特征。 我们的 CoCs是无关注和无注意力的,并且仅仅依靠组合算法来进行空间互动。 由于设计简单,我们展示CoCs通过群集程序的可视化来获得满意的解读性。 我们的Cocs 旨在为图像和视觉代表提供新的视角,它们可能在不同领域享有广泛应用的应用程序, 并展示一些可比较的Convlus- trevlus- 。 els- real- real- real- restals- restals- real sal sal sal be sal be</s>