We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning.
翻译:我们建议并展示一种代表性学习方法,尽量扩大图像和文本地方特征之间的相互信息。这一方法的目标是利用描述图像结果的自由文本中的丰富信息,学习有用的图像表述。我们的方法是通过鼓励由此产生的演示来培训图像和文本编码器,以展示高水平的当地相互信息。我们利用最近与神经网络歧视者在相互信息估计方面取得的进展。我们争辩说,地方相互信息的总和通常对全球相互信息有较低的约束。我们在下游图像分类工作中的实验结果显示了利用本地特征学习图像文本的优势。