Images have become an integral part of online media. This has enhanced self-expression and the dissemination of knowledge, but it poses serious accessibility challenges. Adequate textual descriptions are rare. Captions are more abundant, but they do not consistently provide the needed descriptive details, and systems trained on such texts inherit these shortcomings. To address this, we introduce the publicly available Wikipedia-based corpus Concadia, which consists of 96,918 images with corresponding English-language descriptions, captions, and surrounding context. We use Concadia to further characterize the commonalities and differences between descriptions and captions, and this leads us to the hypothesis that captions, while not substitutes for descriptions, can provide a useful signal for creating effective descriptions. We substantiate this hypothesis by showing that image captioning systems trained on Concadia benefit from having caption embeddings as part of their inputs. These experiments also begin to show how Concadia can be a powerful tool in addressing the underlying accessibility issues posed by image data.
翻译:图像已成为在线媒体的一个组成部分。 这加强了自我表达和知识传播,但带来了严重的无障碍挑战。 适当的文字描述是罕见的。 标题更为丰富, 但没有一贯地提供所需的描述细节, 且关于这些文本的系统也继承了这些缺陷。 为了解决这个问题, 我们引入了以维基百科为基础的可公开查阅的Concadia 文集, 它由96 918张图像组成, 其中包含相应的英语描述、 字幕和周围环境。 我们使用 Concadia 来进一步描述描述描述和字幕之间的共性和差异, 这导致我们得出这样的假设: 标题虽然不是描述的替代品,但可以提供有效描述的有用信号。 我们通过展示关于Concadia 培训的图像说明系统通过将字幕嵌入其投入的一部分而受益。 这些实验还开始展示Concadia 如何成为解决图像数据所构成的无障碍问题的有力工具。