维基百科( Wikipedia.org)是一个基于 Wiki 技术的全球性多语言百科全书协作项目,同时也是一部在网际网络上呈现的网络百科全书网站,其目标及宗旨是为全人类提供自由的百科全书。目前 Alexa 全球网站排名第六。

VIP内容

教程题目:Wikipedia as a Resource for Text Analysis and Retrieval

教程简介

维基百科中由众多网友们参与贡献形成的文章不仅反映了大众或者说网民们越来越广泛的兴趣,也很可能是目前为止最大的公开的、去中心化的非结构化或者半结构化知识库。本教程探讨为维基百科在文本分析和检索相关任务中的作用。利用维基百科的文本分析任务包括共指解析、字义及实体去模糊以及信息提取。

对于信息检索任务,对于查询指令的结构和意义有更好的理解,也可以帮助更好地匹配文档查询、聚合查询结果、为热门实体的查询提供知识检索。这个教学讲座将会对比维基百科与其他人工收集的知识库的特性以及优缺点,将会介绍把维基百科中的半结构化数据转换为结构化数据后的导出资源,以及介绍维基百科及其导出资源在文本分析以及增强信息检索中能起到的作用。

组织者:

Marius Pasca研究兴趣是信息检索和网络,机器智能,自然语言处理。

成为VIP会员查看完整内容
acl19-tutorial-fordistrib.pdf
0
3

最新论文

Language identification is a critical component of language processing pipelines (Jauhiainen et al.,2019) and is not a solved problem in real-world settings. We present a lightweight and effective language identifier that is robust to changes of domain and to the absence of copious training data. The key idea for classification is that the reciprocal of the rank in a frequency table makes an effective additive feature score, hence the term Reciprocal Rank Classifier (RRC). The key finding for language classification is that ranked lists of words and frequencies of characters form a sufficient and robust representation of the regularities of key languages and their orthographies. We test this on two 22-language data sets and demonstrate zero-effort domain adaptation from a Wikipedia training set to a Twitter test set. When trained on Wikipedia but applied to Twitter the macro-averaged F1-score of a conventionally trained SVM classifier drops from 90.9% to 77.7%. By contrast, the macro F1-score of RRC drops only from 93.1% to 90.6%. These classifiers are compared with those from fastText and langid. The RRC performs better than these established systems in most experiments, especially on short Wikipedia texts and Twitter. The RRC classifier can be improved for particular domains and conversational situations by adding words to the ranked lists. Using new terms learned from such conversations, we demonstrate a further 7.9% increase in accuracy of sample message classification, and 1.7% increase for conversation classification. Surprisingly, this made results on Twitter data slightly worse. The RRC classifier is available as an open source Python package (https://github.com/LivePersonInc/lplangid).

0
0
下载
预览
Top