Existing image captioning models are usually trained by cross-entropy (XE) loss and reinforcement learning (RL), which set ground-truth words as hard targets and force the captioning model to learn from them. However, the widely adopted training strategies suffer from misalignment in XE training and inappropriate reward assignment in RL training. To tackle these problems, we introduce a teacher model that serves as a bridge between the ground-truth caption and the caption model by generating some easier-to-learn word proposals as soft targets. The teacher model is constructed by incorporating the ground-truth image attributes into the baseline caption model. To effectively learn from the teacher model, we propose Teacher-Critical Training Strategies (TCTS) for both XE and RL training to facilitate better learning processes for the caption model. Experimental evaluations of several widely adopted caption models on the benchmark MSCOCO dataset show the proposed TCTS comprehensively enhances most evaluation metrics, especially the Bleu and Rouge-L scores, in both training stages. TCTS is able to achieve to-date the best published single model Bleu-4 and Rouge-L performances of 40.2% and 59.4% on the MSCOCO Karpathy test split. Our codes and pre-trained models will be open-sourced.
翻译:为解决这些问题,我们引入了教师模式,作为地面图解和字幕模型之间的桥梁,作为软目标。教师模式的构建方式是将地面图象属性纳入基线说明模型。为了有效地从教师模式中学习,我们建议为XE和RL培训制定师资培训战略,以便利更好地学习说明模型。对一些广泛采用的基本MSCO数据集教学模式的实验性评估显示,拟议的TCTS在两个培训阶段都全面加强大多数评价指标,特别是布雷乌和红色-L的评分。 TCTS能够将最佳的已公布的单一模型Bleu-4和红-红-L的分数更新到最佳的Bleu-RODSB-4和ROC-BRE-BSB-4和RE-RE-RB-RBS-RB-RB-RB-RB-RB-B-RB-M-RB-B-M-RB-M-B-RB-RB-M-M-RB-B-B-RB-B-B-RB-B-RB-B-B-B-B-L-RBS-ML-B-B-L-BS-B-RB-B-RB-B-B-L-L-BS-BS-L-L-B-BS-S-S-S-S-B-B-B-B-S-B-B-B-B-B-B-M-B-B-B-B-B-B-B-B-B-B-M-M-B-B-B-B-M-B-B-B-B-B-B-B-B-B-B-M-B-B-B-M-M-M-M-M-M-M-B-B-B-B-B-B-B-B-B-B-B-B-M-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-B-M-B-M-M-B-M-B-B-B-M-M-B-B-B-B-B-