WebSRC: 基于网络的结构阅读理解数据集 (WebSRC: A Dataset for Web-Based Structural Reading Comprehension)

Web search is an essential way for humans to obtain information, but it's still a great challenge for machines to understand the contents of web pages. In this paper, we introduce the task of structural reading comprehension (SRC) on web. Given a web page and a question about it, the task is to find the answer from the web page. This task requires a system not only to understand the semantics of texts but also the structure of the web page. Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 400K question-answer pairs, which are collected from 6.4K web pages. Along with the QA pairs, corresponding HTML source code, screenshots, and metadata are also provided in our dataset. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no. We evaluate various baselines on our dataset to show the difficulty of our task. We also investigate the usefulness of structural information and visual features. Our dataset and baselines have been publicly available at https://x-lance.github.io/WebSRC/.

翻译：网络搜索是人类获取信息的重要方式, 但对于机器来说, 获取信息仍然是一个巨大的挑战。在本文中, 我们引入了在网络上结构阅读理解(SRC)的任务。根据一个网页和关于它的一个问题, 任务就是从网页上找到答案。这项任务要求不仅一个系统来理解文本的语义, 而且还要了解网页的结构结构。此外, 我们建议WebSRC, 一个基于网络的新型结构性阅读数据集。 WebSRC 由400K的问答对组成, 从6. 4K 网页上收集。与 QA 配对、相应的 HTML 源代码、屏幕截图和元数据一起, 也在我们的数据集中提供。 WebSRC 的每个问题都需要对网页的某种结构理解, 答案要么是网页上的文本, 要么是/ 是/ 否。我们评估了我们数据集上的各种基线, 以显示我们的任务的难度。我们还调查结构信息和视觉特征的有用性。我们的数据设置和基线已经在 https/x/Webgio/Webs.