Web 2.0 has brought with it numerous user-produced data revealing one's thoughts, experiences, and knowledge, which are a great source for many tasks, such as information extraction, and knowledge base construction. However, the colloquial nature of the texts poses new challenges for current natural language processing techniques, which are more adapt to the formal form of the language. Ellipsis is a common linguistic phenomenon that some words are left out as they are understood from the context, especially in oral utterance, hindering the improvement of dependency parsing, which is of great importance for tasks relied on the meaning of the sentence. In order to promote research in this area, we are releasing a Chinese dependency treebank of 319 weibos, containing 572 sentences with omissions restored and contexts reserved.
翻译:Web 2.0 带来了许多用户制作的数据,揭示了一个人的思想、经验和知识,这些是信息提取和知识基础建设等许多任务的重要来源。然而,文本的学术性质给当前的自然语言处理技术带来了新的挑战,这些技术更适应语言的正式形式。 通缩是一种常见的语言现象,有些语言被从上下文中,特别是口头发言中理解,因而被遗漏,妨碍了对依赖性区分的改进,而依赖性区分对于依赖判决意义的任务非常重要。 为了促进这一领域的研究,我们正在释放一个中国依赖性树库,共有319个Weibos, 包含572个句, 并保留了遗漏和背景。