Web Image Context Extraction (WICE) consists in obtaining the textual information describing an image using the content of the surrounding webpage. A common preprocessing step before performing WICE is to render the content of the webpage. When done at a large scale (e.g., for search engine indexation), it may become very computationally costly (up to several seconds per page). To avoid this cost, we introduce a novel WICE approach that combines Graph Neural Networks (GNNs) and Natural Language Processing models. Our method relies on a graph model containing both node types and text as features. The model is fed through several blocks of GNNs to extract the textual context. Since no labeled WICE dataset with ground truth exists, we train and evaluate the GNNs on a proxy task that consists in finding the semantically closest text to the image caption. We then interpret importance weights to find the most relevant text nodes and define them as the image context. Thanks to GNNs, our model is able to encode both structural and semantic information from the webpage. We show that our approach gives promising results to help address the large-scale WICE problem using only HTML data.
翻译:Web 图像背景提取 (WICE) 包括获取文本信息, 描述使用周围网页内容的图像。 执行 WICE 之前的一个常见预处理步骤是让网页内容成为 WICE 之前的一个常见预处理步骤。 当大规模完成时( 例如, 搜索引擎索引化), 它可能会在计算上变得非常昂贵( 以每页几秒钟为限 ) 。 为了避免这一成本, 我们引入了一个新的 WICE 方法, 将图形神经网络( GNNS) 和自然语言处理模式结合起来。 我们的方法依赖于包含节点类型和文字特征的图形模型。 该模型通过几个 GNNS 块来获取, 以提取文本背景背景。 由于没有带有地面真相的标签 WICE 数据集, 我们用一个代理任务来培训和评估 GNNS, 包括找到与图像标题最接近的文字。 然后我们解释重要性的权重, 以找到最相关的文字节点, 并将其定义为图像背景。 感谢 GNNNSS, 我们的模型能够将结构与语义信息进行编码。 我们显示我们的方法只能用 HTML 帮助解决大比例的数据问题。