【推荐】自然语言处理(NLP)指南

2017 年 11 月 17 日 机器学习研究会


点击上方 “机器学习研究会”可以订阅
摘要
 

转自:网路冷眼

Natural Language Processing (NLP) comprises a set of techniques that can be used to achieve many different objectives. Take a look at the following table to figure out which technique can solve your particular problem.

WHAT YOU NEED WHERE TO LOOK
Grouping similar words for search Stemming, Splitting Words, Parsing Documents
Finding words with the same meaning for search Latent Semantic Analysis
Generating realistic names Splitting Words
Understanding how much time does it take to read a text Reading Time
Understanding how difficult to read is a text Readability of a Text
Identifying the language of a text Identifying a Language
Generating a summary of a text SumBasic (word-based), Graph-based Methods: TextRank (relationship-based), Latent Semantic Analysis (semantic-based)
Finding similar documents Latent Semantic Analysis
Identifying entities (e.g., cities, people) in a text Parsing Documents
Understanding the attitude expressed in a text Parsing Documents
Translating a text Parsing Documents

We are going to talk about parsing in the general sense of analyzing a document and extracting its meaning.So, we are going to talk about actual parsing of natural languages, but we will spend most of the time on other techniques. When it comes to understanding programming languages parsing is the way to go, however you can pick specific alternatives for natural languages. In other words, we are mostly going to talk about what you would use instead of parsing, to accomplish your goals.

For instance, if you wanted to find all for statements a programming language file, you would parse it and then count the number of for. Instead, you are probably going to use something like stemming to find all mentions of cats in a natural language document.

This is necessary because the theory behind the parsing of natural languages might be the same one that is behind the parsing of programming languages, however the practice is very dissimilar. In fact, you are not going to build a parser for a natural language. That is unless you work in artificial intelligence or as researcher. You are even rarely going to use one. Rather you are going to find an algorithm that work a simplified model of the document that can only solve your specific problem.

In short, you are going to find tricks to avoid to actually having to parse a natural language. That is why this area of computer science is usually called natural language processing rather than natural language parsing.


Algorithms That Require Data

We are going to see specific solutions to each problem. Mind you that these specific solutions can be quite complex themselves. The more advanced they are, the less they rely on simple algorithms. Usually they need a vast database of data about the language. A logical consequence of this is that it is rarely easy to adopt a tool for one language to be used for another one. Or rather, the tool might work with few adaptations, but to build the database would require a lot of investment. So, for example, you would probably find a ready to use tool to create a summary of an English text, but maybe not one for an Italian one.

For this reason, in this article we concentrate mostly on English language tools. Although we mention if these tools work for other languages. You do not need to know the theoretical differences between languages, such as the number of genders or cases they have. However, you should be aware that the more different a language is from English, the harder would be to apply these techniques or tools to it.

For example, you should not expect to find tools that can work with Chinese (or rather the Chinese writing system). It is not necessarily that these languages are harder to understand programmatically, but there might be less research on them or the methods might be completely different from the ones adopted for English.


The Structure of This Guide

This article is organized according to the tasks we want to accomplish. Which means that the tools and explanation are grouped according to the task they are used for. For instance, there is a section about measuring the properties of a text, such as its difficulty. They are also generally in ascending order of difficulty: it is easier to classify words than entire documents. We start with simple information retrieval techniques and we end in the proper field of natural language processing.

We think it is the most useful way to provide the information you need: you need to do X, we directly show the methods and tools you can use.


Table Of Contents

The following table of contents shows the whole content of this guide.

  1. Classifying Words

    • Stemming

    • Splitting Words

    • Grouping Similar Words

  2. Classifying Documents

    • Reading Time

    • Calculating the Readability of a Text

    • Text Metrics

    • Identifying a Language

  3. Understanding Documents

    • You Need Data

    • The Things You Can Do

    • The Libraries You Can Use

    • SumBasic

    • Graph-based Methods: TextRank

    • Latent Semantic Analysis

    • Other Methods and Libraries

    • Other Uses

    • Generation of Summaries

    • Parsing Documents

  4. Summary

Classifying Words

With the expression classifying words, we intend to include techniques and libraries that group words together.


Grouping Similar Words

We are going to talk about two methods that can group together similar words, for the purpose of information retrieval. Basically, these are methods used to find the documents, with the words we care about, from a pool of all documents. That is useful because if a user search for documents containing the word friend he is probably equally interested in documents containing friends and possibly friended and friendship.

So, to be clear, in this section we are not going to talk about methods to group semantically connected words, such as identifying all pets or all English towns.

The two methods are stemming and division of words into group of characters. The algorithms for the first ones are language dependent, while the ones for the second ones are not. We are going to examine each of them in separate paragraphs.


链接:

https://tomassetti.me/guide-natural-language-processing/


原文链接:

https://m.weibo.cn/1715118170/4174604782463416

“完整内容”请点击【阅读原文】
↓↓↓
登录查看更多
35

相关内容

还在修改博士论文?这份《博士论文写作技巧》为你指南
【论文推荐】文本分析应用的NLP特征推荐
专知会员服务
33+阅读 · 2019年12月8日
开源书:PyTorch深度学习起步
专知会员服务
49+阅读 · 2019年10月11日
强化学习最新教程,17页pdf
专知会员服务
167+阅读 · 2019年10月11日
[综述]深度学习下的场景文本检测与识别
专知会员服务
77+阅读 · 2019年10月10日
机器学习入门的经验与建议
专知会员服务
90+阅读 · 2019年10月10日
计算机视觉最佳实践、代码示例和相关文档
专知会员服务
17+阅读 · 2019年10月9日
中文自然语言处理相关资料集合指南
专知
18+阅读 · 2019年3月10日
五个精彩实用的自然语言处理资源
机器学习研究会
6+阅读 · 2018年2月23日
【推荐】深度学习情感分析综述
机器学习研究会
58+阅读 · 2018年1月26日
【推荐】用Python/OpenCV实现增强现实
机器学习研究会
14+阅读 · 2017年11月16日
【推荐】MXNet深度情感分析实战
机器学习研究会
16+阅读 · 2017年10月4日
【推荐】用Tensorflow理解LSTM
机器学习研究会
36+阅读 · 2017年9月11日
【推荐】RNN/LSTM时序预测
机器学习研究会
25+阅读 · 2017年9月8日
【推荐】GAN架构入门综述(资源汇总)
机器学习研究会
10+阅读 · 2017年9月3日
【推荐】TensorFlow手把手CNN实践指南
机器学习研究会
5+阅读 · 2017年8月17日
A Comprehensive Survey on Transfer Learning
Arxiv
117+阅读 · 2019年11月7日
Arxiv
6+阅读 · 2019年9月4日
Arxiv
53+阅读 · 2018年12月11日
Arxiv
21+阅读 · 2018年8月30日
Arxiv
5+阅读 · 2018年1月30日
VIP会员
相关VIP内容
还在修改博士论文?这份《博士论文写作技巧》为你指南
【论文推荐】文本分析应用的NLP特征推荐
专知会员服务
33+阅读 · 2019年12月8日
开源书:PyTorch深度学习起步
专知会员服务
49+阅读 · 2019年10月11日
强化学习最新教程,17页pdf
专知会员服务
167+阅读 · 2019年10月11日
[综述]深度学习下的场景文本检测与识别
专知会员服务
77+阅读 · 2019年10月10日
机器学习入门的经验与建议
专知会员服务
90+阅读 · 2019年10月10日
计算机视觉最佳实践、代码示例和相关文档
专知会员服务
17+阅读 · 2019年10月9日
相关资讯
中文自然语言处理相关资料集合指南
专知
18+阅读 · 2019年3月10日
五个精彩实用的自然语言处理资源
机器学习研究会
6+阅读 · 2018年2月23日
【推荐】深度学习情感分析综述
机器学习研究会
58+阅读 · 2018年1月26日
【推荐】用Python/OpenCV实现增强现实
机器学习研究会
14+阅读 · 2017年11月16日
【推荐】MXNet深度情感分析实战
机器学习研究会
16+阅读 · 2017年10月4日
【推荐】用Tensorflow理解LSTM
机器学习研究会
36+阅读 · 2017年9月11日
【推荐】RNN/LSTM时序预测
机器学习研究会
25+阅读 · 2017年9月8日
【推荐】GAN架构入门综述(资源汇总)
机器学习研究会
10+阅读 · 2017年9月3日
【推荐】TensorFlow手把手CNN实践指南
机器学习研究会
5+阅读 · 2017年8月17日
相关论文
A Comprehensive Survey on Transfer Learning
Arxiv
117+阅读 · 2019年11月7日
Arxiv
6+阅读 · 2019年9月4日
Arxiv
53+阅读 · 2018年12月11日
Arxiv
21+阅读 · 2018年8月30日
Arxiv
5+阅读 · 2018年1月30日
Top
微信扫码咨询专知VIP会员