Given the increasing number of livestreaming videos, automatic speech recognition and post-processing for livestreaming video transcripts are crucial for efficient data management as well as knowledge mining. A key step in this process is punctuation restoration which restores fundamental text structures such as phrase and sentence boundaries from the video transcripts. This work presents a new human-annotated corpus, called BehancePR, for punctuation restoration in livestreaming video transcripts. Our experiments on BehancePR demonstrate the challenges of punctuation restoration for this domain. Furthermore, we show that popular natural language processing toolkits are incapable of detecting sentence boundary on non-punctuated transcripts of livestreaming videos, calling for more research effort to develop robust models for this area.
翻译:鉴于在网上直播录像、自动语音识别和录像记录誊本后处理数量不断增加,对于有效的数据管理和知识挖掘至关重要,这一进程的一个关键步骤是标点恢复,恢复基本文字结构,例如从录像记录誊本中的短语和句号界限;这项工作提出了一个新的人文附加说明材料,称为BehancePR,用于在网上直播录像记录誊本中的标点恢复。我们在BehancePR的实验中展示了在这方面恢复标点方面存在的挑战。此外,我们显示流行的自然语言处理工具箱无法在不穿孔的流动录像记录上发现句号界限,因此要求作出更多的研究努力,为这一领域开发强有力的模型。