Deep representations across modalities are inherently intertwined. In this paper, we systematically analyze the spectral characteristics of various semantic and pixel encoders. Interestingly, our study uncovers a highly inspiring and rarely explored correspondence between an encoder's feature spectrum and its functional role: semantic encoders primarily capture low-frequency components that encode abstract meaning, whereas pixel encoders additionally retain high-frequency information that conveys fine-grained detail. This heuristic finding offers a unifying perspective that ties encoder behavior to its underlying spectral structure. We define it as the Prism Hypothesis, where each data modality can be viewed as a projection of the natural world onto a shared feature spectrum, just like the prism. Building on this insight, we propose Unified Autoencoding (UAE), a model that harmonizes semantic structure and pixel details via an innovative frequency-band modulator, enabling their seamless coexistence. Extensive experiments on ImageNet and MS-COCO benchmarks validate that our UAE effectively unifies semantic abstraction and pixel-level fidelity into a single latent space with state-of-the-art performance.
翻译:跨模态的深度表示本质上是相互交织的。本文系统分析了多种语义编码器与像素编码器的频谱特性。有趣的是,我们的研究揭示了一个极具启发性且鲜被探索的对应关系:编码器的特征频谱与其功能角色之间存在明确关联——语义编码器主要捕获编码抽象含义的低频分量,而像素编码器额外保留传达细粒度细节的高频信息。这一启发式发现提供了一个统一视角,将编码器行为与其底层频谱结构联系起来。我们将其定义为棱镜假说:每种数据模态可被视为自然世界在共享特征谱上的投影,正如棱镜分光现象。基于此洞见,我们提出了统一自编码(UAE)模型,该模型通过创新的频带调制器协调语义结构与像素细节,实现二者的无缝共存。在ImageNet和MS-COCO基准上的大量实验验证了我们的UAE能够以先进性能将语义抽象与像素级保真度有效统一至单一潜在空间中。