Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic cats) have vertically elongated pupils linked to ambush predation; yet, how such specializations manifest in downstream visual representations remains incompletely understood. We present a unified, frozen-encoder benchmark that quantifies feline-human cross-species representational alignment in the wild, across convolutional networks, supervised Vision Transformers, windowed transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel Alignment (linear and RBF) and Representational Similarity Analysis, with additional distributional and stability tests reported in the paper. Across models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF $\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$), peaking at early blocks, indicating that token-level self-supervision induces early-stage features that bridge species-specific statistics. Supervised ViTs are competitive on CKA yet show weaker geometric correspondence than DINO (e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at block14), revealing depth-dependent divergences between similarity and representational geometry. CNNs remain strong baselines but below plain ViTs on alignment, and windowed transformers underperform plain ViTs, implicating architectural inductive biases in cross-species alignment. Results indicate that self-supervision coupled with ViT inductive biases yields representational geometries that more closely align feline and human visual systems than widely used CNNs and windowed Transformers, providing testable neuroscientific hypotheses about where and how cross-species visual computations converge. We release our code and dataset for reference and reproducibility.
翻译:猫与人类在眼部解剖结构上存在差异。最显著的是,家猫(Felis Catus)具有与伏击捕食相关的垂直椭圆形瞳孔;然而,这种特化如何影响下游视觉表征仍未完全阐明。我们提出了一个统一的冻结编码器基准,通过层间中心核对齐(线性与RBF核)和表征相似性分析,量化了自然场景下卷积网络、监督视觉Transformer、窗口化Transformer及自监督ViT(DINO)中猫科与人类跨物种表征对齐性,并在论文中补充了分布稳定性测试。在所有模型中,DINO ViT-B/16获得了最显著的对齐性(平均CKA-RBF≈0.814,平均CKA-linear≈0.745,平均RSA≈0.698),在早期模块达到峰值,表明基于令牌的自监督能诱导出桥接物种特异性统计特征的早期特征。监督ViT在CKA指标上表现相当,但几何对应性弱于DINO(例如ViT-B/16在block8的RSA≈0.53;ViT-L/16在block14的RSA≈0.47),揭示了相似性与表征几何间存在深度依赖性分歧。CNN保持较强基线性能但对齐性低于普通ViT,窗口化Transformer表现逊于普通ViT,这暗示架构归纳偏置对跨物种对齐的影响。结果表明,自监督与ViT归纳偏置的结合所产生的表征几何,比广泛使用的CNN和窗口化Transformer更紧密地对齐猫科与人类视觉系统,为跨物种视觉计算在何处及如何收敛提供了可验证的神经科学假说。我们公开了代码与数据集以供参考和复现。