可比语料库质量量化与提升方法研究

项目名称： 可比语料库质量量化与提升方法研究

项目编号： No.61300144

项目类型： 青年科学基金项目

立项/批准年度： 2014

项目学科： 自动化技术、计算机技术

项目作者： 李波

作者单位： 华中师范大学

项目金额： 23万元

中文摘要： 鉴于平行语料库在某些领域和语言对上的稀缺性，可比语料库近年来受到了研究者的重视并已被成功应用于多种应用任务中。已有的基于可比语料的知识挖掘工作大多关注挖掘算法的优化，它们的发展已经遇到瓶颈难以提升。以提升可比语料质量来间接提升挖掘算法的性能是一种符合直观经验的思路，但现有工作大多忽视了可比语料的质量差异及其对应用性能的影响。鉴于此，本项目将系统研究可比语料质量的量化、评测、提升方法以及对实际应用的影响。在质量量化上，可比度指标综合考虑了外在词汇特征和内在主题相关性特征；在可比度性能评测上，我们设计了与真实语料相似且可量化的基准测试语料和性能评测指标；针对可比语料质量提升，项目采用了高效的层次聚类策略和子聚类选择方法；最后，应用部分通过双语词典抽取和跨语言信息检索任务来检验项目整体策略的有效性。本项目对揭示可比语料质量的重要性，对可比度指标的设计、评测以及相关应用性能的提升都有重要价值。

中文关键词： 可比语料库；语料库质量；信息检索；；

英文摘要： Recently, researchers have paid much attention to comparable corpora that can be used in various NLP tasks in restricted domains or language pairs where parallel corpora do not exist in high volume. Previous work mining knowledge from comparable corpora has mostly been trying to improve the mining algorithms themselves and has met with a bottleneck in terms of performance. It is intuitive that the performance of NLP tasks should benefit from better quality of comparable corpora, a fact that has been largely ignored in existing work. We thus plan to investigate in the project approaches for measuring, evaluating and enhancing comparable corpus quality, as well as the impact of those approaches on NLP tasks. To be exact, we will make use of the external word feature and the internal topic feature to measure the corpus quality. In order to evaluate the comparability measure, several performance metrics will be developed on top of test corpora with gold-standard comparability levels that resemble corpora in real-world applications. Efficient strategies will then be designed for hierarchical clustering and sub-cluster choosing so as to enhance the quality of existing corpora. Lastly, the application part will rely on such applications as bilingual lexicon extraction and CLIR to validate the overall idea of the projec

英文关键词： comparable corpora；corpus quality；information retrieval；；

成为VIP会员查看完整内容