Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by leveraging powerful visual pre-trained models and transformer-based generation architecture for more flexible model training and faster inference speed. State-of-the-art approaches simply extract isolated concepts or attributes to assist description generation. However, such approaches do not consider the hierarchical semantic structure in the textual domain, which leads to an unpredictable mapping between visual representations and concept words. To this end, we propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN), which is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics. Specifically, we design a novel embedding method called tree-structured prototype, producing a set of hierarchical representative embeddings which capture the hierarchical semantic structure in textual space. To utilize such tree-structured prototypes into visual cognition, we also propose a progressive aggregation module to exploit semantic relationships within the image and prototypes. By applying our PTSN to the end-to-end captioning framework, extensive experiments conducted on MSCOCO dataset show that our method achieves a new state-of-the-art performance with 144.2% (single model) and 146.5% (ensemble of 4 models) CIDEr scores on `Karpathy' split and 141.4% (c5) and 143.9% (c40) CIDEr scores on the official online test server. Trained models and source code have been released at: https://github.com/NovaMind-Z/PTSN.
翻译:图像字幕研究正在向一个完全端到端模式趋势转变,它利用强大的视觉前培训模型和基于变压器的生成结构,进行更灵活的模型培训和更快的推断速度。最先进的方法只是提取孤立的概念或属性,以协助描述生成。然而,这些方法并不考虑文本域的等级语义结构,从而导致视觉表达和概念词词之间的不可预测的映射。为此,我们提议建立一个新型的“进步树结构化原型网络”(dubbbbbed PTScomN),这是首次尝试缩小预测词的范围,以适当的语义学模型进行更灵活的模型培训和更快的推断速度。具体地说,我们设计了一种新的嵌入式方法,称为树结构原型概念或属性。 但是,这些方法并不考虑在文本域域内显示等级的语义结构结构结构结构结构。为了将这种树结构原型原型用于视觉读取,我们还提议一个渐进的汇总模块,以利用图像和原型内部的语系关系。通过将我们的PTSTN到正式C-PL%的正读计算模型/直径直径直径判框架,我们用了DNA的C-SAS-SODSOD-SOD-C-SAR 14 SALSDSDSDSD 。