In this report, we present our champion solution to the WSDM2023 Toloka Visual Question Answering (VQA) Challenge. Different from the common VQA and visual grounding (VG) tasks, this challenge involves a more complex scenario, i.e. inferring and locating the object implicitly specified by the given interrogative question. For this task, we leverage ViT-Adapter, a pre-training-free adapter network, to adapt multi-modal pre-trained Uni-Perceiver for better cross-modal localization. Our method ranks first on the leaderboard, achieving 77.5 and 76.347 IoU on public and private test sets, respectively. It shows that ViT-Adapter is also an effective paradigm for adapting the unified perception model to vision-language downstream tasks. Code and models will be released at https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023.
翻译:在本报告中,我们提出了WSDM2023 Toloka视觉问答(VQA)挑战的首选解决方案,不同于通用VQA和视觉定位(VG)任务,这一挑战涉及更为复杂的情景,即推断和定位特定询问问题所隐含的物体,为此,我们利用培训前免费适配器ViT-Adapter网络,使经过培训的多式预先培训的Uni-Perceiver适应更好的跨模式本地化。我们的方法排名在领先板上先,在公共和私营测试机上分别达到77.5和76.347 IoU。它表明ViT-Adapter也是使统一认知模型适应愿景语言下游任务的有效范例。代码和模型将在https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023发布。