Comprehending the rich semantics in an image and ordering them in linguistic order are essential to compose a visually-grounded and linguistically coherent description for image captioning. Modern techniques commonly capitalize on a pre-trained object detector/classifier to mine the semantics in an image, while leaving the inherent linguistic ordering of semantics under-exploited. In this paper, we propose a new recipe of Transformer-style structure, namely Comprehending and Ordering Semantics Networks (COS-Net), that novelly unifies an enriched semantic comprehending and a learnable semantic ordering processes into a single architecture. Technically, we initially utilize a cross-modal retrieval model to search the relevant sentences of each image, and all words in the searched sentences are taken as primary semantic cues. Next, a novel semantic comprehender is devised to filter out the irrelevant semantic words in primary semantic cues, and meanwhile infer the missing relevant semantic words visually grounded in the image. After that, we feed all the screened and enriched semantic words into a semantic ranker, which learns to allocate all semantic words in linguistic order as humans. Such sequence of ordered semantic words are further integrated with visual tokens of images to trigger sentence generation. Empirical evidences show that COS-Net clearly surpasses the state-of-the-art approaches on COCO and achieves to-date the best CIDEr score of 141.1% on Karpathy test split. Source code is available at \url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/cosnet}.
翻译:将丰富的语义解译成图像, 并按语言顺序排列这些丰富的语义, 这对于为图像字幕构建一个可见的、 语言一致的描述至关重要。 现代技术通常会利用预先训练的物体探测器/ 分类器来将语义解成图像, 同时将语义解学的内在顺序保留在开发不足的语义提示中。 在本文中, 我们提出一个新的变异结构的配方, 即拼写和命令语义网络( COS- Net), 以新颖的方式将丰富的语义理解和可学习的语义排序程序统一成一个单一的架构。 从技术上讲, 我们最初使用一个跨模式的检索模型来搜索每个图像的相关句子, 而搜索的句子中的所有字句都被当作主要的语义缩略图提示。 下一步, 将新的语义解解解解解释器( commantical net) 结构中无关的语义词词, 也就是在图像中缺少的相关语义流/ 。 之后, 我们将所有筛选和添加的语义解语义解的语义/ 语义解的语义解的语义解的语义/, 直观的语义排序,, 以显示的语义/ 的语义序列序列顺序的语义序列序列序列顺序, 。