For an image with multiple scene texts, different people may be interested in different text information. Current text-aware image captioning models are not able to generate distinctive captions according to various information needs. To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap). With questions as control signals, this task requires models to understand questions, find related scene texts and describe them together with objects fluently in human language. Based on two existing text-aware captioning datasets, we automatically construct two datasets, ControlTextCaps and ControlVizWiz to support the task. We propose a novel Geometry and Question Aware Model (GQAM). GQAM first applies a Geometry-informed Visual Encoder to fuse region-level object features and region-level scene text features with considering spatial relationships. Then, we design a Question-guided Encoder to select the most relevant visual features for each question. Finally, GQAM generates a personalized text-aware caption with a Multimodal Decoder. Our model achieves better captioning performance and question answering ability than carefully designed baselines on both two datasets. With questions as control signals, our model generates more informative and diverse captions than the state-of-the-art text-aware captioning model. Our code and datasets are publicly available at https://github.com/HAWLYQ/Qc-TextCap.
翻译:对于具有多个场景文本的图像,不同的人可能对不同的文本信息感兴趣。 当前文本识别图像字幕模型无法根据各种信息需求生成不同的字幕。 要探索如何生成个性化文本识别字幕, 我们定义了一项新的挑战性任务, 即 Question- 受质疑的文本识别图像描述( Qc- TextCap) 。 使用作为控制信号的问题, 这项任务需要模型来理解问题, 找到相关的场景文本, 并用流畅的人类语言描述它们。 根据两个现有的文本识别字幕数据集, 我们自动建立两个数据集, 控制文本描述和 控制 VizWiz 支持任务 。 我们提议了一个新的地理测量和问题意识识别模型( GQAM) 。 GQAM首先对连接区域级对象特征和区域级图像文本文本进行定位。 然后, 我们设计了一个可选择每个问题最相关视觉特征的 。 最后, GQAM 创建了一个个人化文本定位/ Creadable Q, 它比我们的数据定位/ Diodrodal 都具有更清晰的文本。