MUSIED:从多种来源不同非正式文本中发现事件的基准 (MUSIED: A Benchmark for Event Detection from Multi-Source Heterogeneous Informal Texts)

Event detection (ED) identifies and classifies event triggers from unstructured texts, serving as a fundamental task for information extraction. Despite the remarkable progress achieved in the past several years, most research efforts focus on detecting events from formal texts (e.g., news articles, Wikipedia documents, financial announcements). Moreover, the texts in each dataset are either from a single source or multiple yet relatively homogeneous sources. With massive amounts of user-generated text accumulating on the Web and inside enterprises, identifying meaningful events in these informal texts, usually from multiple heterogeneous sources, has become a problem of significant practical value. As a pioneering exploration that expands event detection to the scenarios involving informal and heterogeneous texts, we propose a new large-scale Chinese event detection dataset based on user reviews, text conversations, and phone conversations in a leading e-commerce platform for food service. We carefully investigate the proposed dataset's textual informality and multi-source heterogeneity characteristics by inspecting data samples quantitatively and qualitatively. Extensive experiments with state-of-the-art event detection methods verify the unique challenges posed by these characteristics, indicating that multi-source informal event detection remains an open problem and requires further efforts. Our benchmark and code are released at \url{https://github.com/myeclipse/MUSIED}.

翻译：尽管过去几年取得了显著进展,但大多数研究工作的重点是从正式文本中发现事件(例如,新闻文章、维基百科文件、财务公告等),此外,每个数据集中的文本要么来自单一来源,要么来自多个但相对统一的来源。随着大量用户生成的文本在网络和企业内部积累,查明这些非正式文本中有意义的事件,通常来自多种不同来源,这已成为一个具有重大实际价值的问题。作为开拓性探索,将事件探测扩大到涉及非正式和不同文本的情景,我们根据用户审查、文本谈话和电话交谈,在食品服务的主要电子商务平台上提出一个新的大型中国事件探测数据集。我们仔细调查拟议的数据集的文本非正规性和多源异质性特征,从数量和质量上检查数据样本。与州级事件探测方法进行的广泛试验,核实这些特征构成的独特挑战,表明多源非正式事件探测仍然是一个公开问题,需要进一步努力。我们的数据集的素质和多源代码/MURIFR/MIFMR_FAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR_BAR