Due to the success of Bidirectional Encoder Representations from Transformers (BERT) in natural language process (NLP), the multi-head attention transformer has been more and more prevalent in computer-vision researches (CV). However, it still remains a challenge for researchers to put forward complex tasks such as vision detection and semantic segmentation. Although multiple Transformer-Based architectures like DETR and ViT-FRCNN have been proposed to complete object detection task, they inevitably decreases discrimination accuracy and brings down computational efficiency caused by the enormous learning parameters and heavy computational complexity incurred by the traditional self-attention operation. In order to alleviate these issues, we present a novel object detection architecture, named Convolutional vision Transformer Based Attentive Single Shot MultiBox Detector (CvT-ASSD), that built on the top of Convolutional vision Transormer (CvT) with the efficient Attentive Single Shot MultiBox Detector (ASSD). We provide comprehensive empirical evidence showing that our model CvT-ASSD can leads to good system efficiency and performance while being pretrained on large-scale detection datasets such as PASCAL VOC and MS COCO. Code has been released on public github repository at https://github.com/albert-jin/CvT-ASSD.
翻译:由于来自变异器(BERT)的双向编码器在自然语言工艺(NLP)中的成功,多头关注变压器在计算机视觉研究(CV)中越来越普遍。然而,对于研究人员来说,提出视觉探测和语义分割等复杂任务仍然是一项挑战。虽然已提议DTR和VIT-FRCNN等多种以变异器为基础的结构完成目标探测任务,但由于传统自省操作产生的巨大的学习参数和沉重的计算复杂性,它们不可避免地降低差别精确度,降低计算效率。为了缓解这些问题,我们提出了一个新型的物体探测结构,名为“变动视觉变异器”,其名称为“超导式单向多管检测器”(CvT-ASSD),建在“变动图像转换器”顶端,并配有高效的Attentitive 单向多盘检测器(ASSD),我们提供了全面的实证证据,表明我们的CvT-ASSDD模型能够带来良好的系统效率和性工作,同时正在对ASA-CSAL 大规模检测系统数据库进行前的测试。