时间信息抽取系统评价框架 (tieval: An Evaluation Framework for Temporal Information Extraction Systems)

Temporal information extraction (TIE) has attracted a great deal of interest over the last two decades, leading to the development of a significant number of datasets. Despite its benefits, having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems. On the one hand, different datasets have different annotation schemes, thus hindering the comparison between competitors across different corpora. On the other hand, the fact that each corpus is commonly disseminated in a different format requires a considerable engineering effort for a researcher/practitioner to develop parsers for all of them. This constraint forces researchers to select a limited amount of datasets to evaluate their systems which consequently limits the comparability of the systems. Yet another obstacle that hinders the comparability of the TIE systems is the evaluation metric employed. While most research works adopt traditional metrics such as precision, recall, and $F_1$, a few others prefer temporal awareness -- a metric tailored to be more comprehensive on the evaluation of temporal systems. Although the reason for the absence of temporal awareness in the evaluation of most systems is not clear, one of the factors that certainly weights this decision is the necessity to implement the temporal closure algorithm in order to compute temporal awareness, which is not straightforward to implement neither is currently easily available. All in all, these problems have limited the fair comparison between approaches and consequently, the development of temporal extraction systems. To mitigate these problems, we have developed tieval, a Python library that provides a concise interface for importing different corpora and facilitates system evaluation. In this paper, we present the first public release of tieval and highlight its most relevant features.

翻译：在过去二十年中,时间信息提取(TIE)吸引了许多人的兴趣,导致大量数据集的开发。尽管它有其好处,但获得大量的公司,使得在基准TIE系统时很难使用。一方面,不同的数据集有不同的批注办法,从而阻碍了不同公司竞争者之间的比较。另一方面,由于每套材料通常以不同格式传播,因此需要为研究者/执行者作出大量工程努力,以便为所有这些数据集开发分析器。这迫使研究人员选择数量有限的数据集来评估其系统,从而限制系统的可比性。另一方面,不同的数据集有不同的批注办法,一方面,不同的数据集采用不同的批注办法,从而妨碍不同公司之间的比较。虽然大多数研究采用诸如精确、回顾和$F_1等传统指标,但有少数其他研究更倾向于时间认识 -- -- 一种为时间系统评估更全面设计的计量标准。尽管大多数系统缺乏时间认识的原因并不明确,但大多数系统评估缺乏时间上的认识,从而限制了系统的系统可比性;另一方面,目前最直接的相对的系统需要的是,这些直截面的系统是所有的时间排序。