评估框架tieval：面向时间信息提取系统的评估框架 (tieval: An Evaluation Framework for Temporal Information Extraction Systems)

Temporal information extraction (TIE) has attracted a great deal of interest over the last two decades, leading to the development of a significant number of datasets. Despite its benefits, having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems. On the one hand, different datasets have different annotation schemes, thus hindering the comparison between competitors across different corpora. On the other hand, the fact that each corpus is commonly disseminated in a different format requires a considerable engineering effort for a researcher/practitioner to develop parsers for all of them. This constraint forces researchers to select a limited amount of datasets to evaluate their systems which consequently limits the comparability of the systems. Yet another obstacle that hinders the comparability of the TIE systems is the evaluation metric employed. While most research works adopt traditional metrics such as precision, recall, and $F_1$, a few others prefer temporal awareness -- a metric tailored to be more comprehensive on the evaluation of temporal systems. Although the reason for the absence of temporal awareness in the evaluation of most systems is not clear, one of the factors that certainly weights this decision is the necessity to implement the temporal closure algorithm in order to compute temporal awareness, which is not straightforward to implement neither is currently easily available. All in all, these problems have limited the fair comparison between approaches and consequently, the development of temporal extraction systems. To mitigate these problems, we have developed tieval, a Python library that provides a concise interface for importing different corpora and facilitates system evaluation. In this paper, we present the first public release of tieval and highlight its most relevant features.

翻译：时间信息提取（TIE）在过去20年里吸引了广泛的关注，导致了大量数据集的开发。尽管它具有很多好处，但是拥有大量语料库却使得基准TIE系统变得困难。一方面，不同的数据集有不同的注释方案，因此阻碍了在不同语料库之间比较竞争对手。另一方面，每个语料库通常以不同的格式传播，因此需要相当的工程工作，才能为它们中的所有语料库开发解析器。这一限制迫使研究人员选择有限的数据集来评估他们的系统，从而限制了系统的可比性。阻碍TIE系统比较的另一个障碍是使用的评估指标。虽然大多数研究工作采用传统的指标，如精度，召回和F1值，但少数人更喜欢时间感知——一种专为时间系统设计的更全面的指标。尽管没有在大多数系统的评估中采用时间感知的原因不明确，但肯定有一个因素体重这一决定的必要性，即必须实现时间闭包算法才能计算时间感知，这不是容易实现的，也不容易获得。总之，这些问题限制了方法之间的公平比较，从而限制了时间提取系统的发展。为了缓解这些问题，我们开发了tieval，这是一个Python库，提供了一个简洁的接口，用于导入不同的语料库并促进系统评估。本文介绍了tieval的首次公开发布，并强调其最相关的特点。