We describe the development, characteristics and availability of a test collection for the task of Web table retrieval, which uses a large-scale Web Table Corpora extracted from the Common Crawl. Since a Web table usually has rich context information such as the page title and surrounding paragraphs, we not only provide relevance judgments of query-table pairs, but also the relevance judgments of query-table context pairs with respect to a query, which are ignored by previous test collections. To facilitate future research with this benchmark, we provide details about how the dataset is pre-processed and also baseline results from both traditional and recently proposed table retrieval methods. Our experimental results show that proper usage of context labels can benefit previous table retrieval methods.
翻译:我们描述用于计算万维网表格检索任务的测试集的开发、特点和可用性,该测试集使用从“共同拖网”中提取的大型万维网表格Corpora。由于万维网表格通常具有丰富的背景信息,如页面标题和周围段落,因此我们不仅提供对查询表对的适切性判断,而且提供对查询表对的相近性判断,而先前的测试集则忽略了这些判断。为了便利今后对这一基准的研究,我们提供了数据集如何预先处理的细节,以及传统和最近提议的表格检索方法的基线结果。我们的实验结果表明,适当使用上下文标签可以有利于以前的表格检索方法。