Can Transformer perform $2\mathrm{D}$ object-level recognition from a pure sequence-to-sequence perspective with minimal knowledge about the $2\mathrm{D}$ spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the na\"ive Vision Transformer with the fewest possible modifications as well as inductive biases. We find that YOLOS pre-trained on the mid-sized ImageNet-$1k$ dataset only can already achieve competitive object detection performance on COCO, \textit{e.g.}, YOLOS-Base directly adopted from BERT-Base can achieve $42.0$ box AP. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through object detection. Code and model weights are available at \url{https://github.com/hustvl/YOLOS}.
翻译:变换器能否在对2\ mathrm{D} $ 空间结构知之甚少的情况下, 从纯序列到序列的角度, 执行2\ mathrm{D} $2\ mathrm{D} 目标层面的识别? 为了回答这个问题, 我们只介绍一个序列( YOLOS ), 这是一系列基于“ na\'ive Vision 变换器” 的物体检测模型, 其修改和感应偏差尽可能少。 我们发现, YOLOS 在中等规模的图像Net-1k$ 数据集上接受过预先训练的, 只能取得COCO(\ textit{e.g.}) 的竞争性物体检测性能, 而在BERT- Base直接采用的 YOLOS Base 可以实现 42. 0 美元框 AP 。 我们还讨论了当前在变换器前计划和通过物体探测的模型缩放战略的影响和限制。 代码和模型重量可在\ url{https:// github.com/ hustvl/ YOLOS} 上查阅 。