MUG: 一个通用的会议理解与生成基准 (MUG: A General Meeting Understanding and Generation Benchmark)

Listening to long video/audio recordings from video conferencing and online courses for acquiring information is extremely inefficient. Even after ASR systems transcribe recordings into long-form spoken language documents, reading ASR transcripts only partly speeds up seeking information. It has been observed that a range of NLP applications, such as keyphrase extraction, topic segmentation, and summarization, significantly improve users' efficiency in grasping important information. The meeting scenario is among the most valuable scenarios for deploying these spoken language processing (SLP) capabilities. However, the lack of large-scale public meeting datasets annotated for these SLP tasks severely hinders their advancement. To prompt SLP advancement, we establish a large-scale general Meeting Understanding and Generation Benchmark (MUG) to benchmark the performance of a wide range of SLP tasks, including topic segmentation, topic-level and session-level extractive summarization and topic title generation, keyphrase extraction, and action item detection. To facilitate the MUG benchmark, we construct and release a large-scale meeting dataset for comprehensive long-form SLP development, the AliMeeting4MUG Corpus, which consists of 654 recorded Mandarin meeting sessions with diverse topic coverage, with manual annotations for SLP tasks on manual transcripts of meeting recordings. To the best of our knowledge, the AliMeeting4MUG Corpus is so far the largest meeting corpus in scale and facilitates most SLP tasks. In this paper, we provide a detailed introduction of this corpus, SLP tasks and evaluation methods, baseline systems and their performance.

翻译：听长视频/音频记录作为获得信息的方法非常低效。即使ASR系统将记录转录为长形式的口语语言文档，阅读ASR转录只能在一定程度上加快信息检索的速度。观察到一系列自然语言处理应用程序（例如关键词提取、主题分割和摘要）显着提高了用户把握重要信息的效率，因此针对会议场景部署这些口语处理能力非常有价值。然而，缺乏针对这些自然语言处理任务注释的大规模公共会议数据集严重阻碍了其推广。为了推动自然语言处理的发展，我们建立了一个大规模的通用会议理解与生成基准（MUG），以基准测试各种自然语言处理任务的性能，包括主题分割、主题级别和会话级别的抽象摘要、主题标题生成、关键词提取和行动项检测。为了促进MUG基准测试，我们构建并发布一个大规模的会议数据集，称为AliMeeting4MUG语料库，该数据集包括654个记录的汉语会议会话，涵盖多种主题，并在会议记录的手动转录上进行了自然语言处理任务的手动注释。据我们所知，AliMeeting4MUG语料库是迄今为止规模最大的会议语料库，并促进了大多数自然语言处理任务的开发。在本文中，我们提供了这个语料库、自然语言处理任务和评估方法、基线系统及其性能的详细介绍。