转自:网路冷眼
Natural Language Processing (NLP) comprises a set of techniques that can be used to achieve many different objectives. Take a look at the following table to figure out which technique can solve your particular problem.
WHAT YOU NEED | WHERE TO LOOK |
---|---|
Grouping similar words for search | Stemming, Splitting Words, Parsing Documents |
Finding words with the same meaning for search | Latent Semantic Analysis |
Generating realistic names | Splitting Words |
Understanding how much time does it take to read a text | Reading Time |
Understanding how difficult to read is a text | Readability of a Text |
Identifying the language of a text | Identifying a Language |
Generating a summary of a text | SumBasic (word-based), Graph-based Methods: TextRank (relationship-based), Latent Semantic Analysis (semantic-based) |
Finding similar documents | Latent Semantic Analysis |
Identifying entities (e.g., cities, people) in a text | Parsing Documents |
Understanding the attitude expressed in a text | Parsing Documents |
Translating a text | Parsing Documents |
We are going to talk about parsing in the general sense of analyzing a document and extracting its meaning.So, we are going to talk about actual parsing of natural languages, but we will spend most of the time on other techniques. When it comes to understanding programming languages parsing is the way to go, however you can pick specific alternatives for natural languages. In other words, we are mostly going to talk about what you would use instead of parsing, to accomplish your goals.
For instance, if you wanted to find all for statements a programming language file, you would parse it and then count the number of for. Instead, you are probably going to use something like stemming to find all mentions of cats in a natural language document.
This is necessary because the theory behind the parsing of natural languages might be the same one that is behind the parsing of programming languages, however the practice is very dissimilar. In fact, you are not going to build a parser for a natural language. That is unless you work in artificial intelligence or as researcher. You are even rarely going to use one. Rather you are going to find an algorithm that work a simplified model of the document that can only solve your specific problem.
In short, you are going to find tricks to avoid to actually having to parse a natural language. That is why this area of computer science is usually called natural language processing rather than natural language parsing.
We are going to see specific solutions to each problem. Mind you that these specific solutions can be quite complex themselves. The more advanced they are, the less they rely on simple algorithms. Usually they need a vast database of data about the language. A logical consequence of this is that it is rarely easy to adopt a tool for one language to be used for another one. Or rather, the tool might work with few adaptations, but to build the database would require a lot of investment. So, for example, you would probably find a ready to use tool to create a summary of an English text, but maybe not one for an Italian one.
For this reason, in this article we concentrate mostly on English language tools. Although we mention if these tools work for other languages. You do not need to know the theoretical differences between languages, such as the number of genders or cases they have. However, you should be aware that the more different a language is from English, the harder would be to apply these techniques or tools to it.
For example, you should not expect to find tools that can work with Chinese (or rather the Chinese writing system). It is not necessarily that these languages are harder to understand programmatically, but there might be less research on them or the methods might be completely different from the ones adopted for English.
This article is organized according to the tasks we want to accomplish. Which means that the tools and explanation are grouped according to the task they are used for. For instance, there is a section about measuring the properties of a text, such as its difficulty. They are also generally in ascending order of difficulty: it is easier to classify words than entire documents. We start with simple information retrieval techniques and we end in the proper field of natural language processing.
We think it is the most useful way to provide the information you need: you need to do X, we directly show the methods and tools you can use.
The following table of contents shows the whole content of this guide.
Classifying Words
Stemming
Splitting Words
Grouping Similar Words
Classifying Documents
Reading Time
Calculating the Readability of a Text
Text Metrics
Identifying a Language
Understanding Documents
You Need Data
The Things You Can Do
The Libraries You Can Use
SumBasic
Graph-based Methods: TextRank
Latent Semantic Analysis
Other Methods and Libraries
Other Uses
Generation of Summaries
Parsing Documents
Summary
With the expression classifying words, we intend to include techniques and libraries that group words together.
We are going to talk about two methods that can group together similar words, for the purpose of information retrieval. Basically, these are methods used to find the documents, with the words we care about, from a pool of all documents. That is useful because if a user search for documents containing the word friend he is probably equally interested in documents containing friends and possibly friended and friendship.
So, to be clear, in this section we are not going to talk about methods to group semantically connected words, such as identifying all pets or all English towns.
The two methods are stemming and division of words into group of characters. The algorithms for the first ones are language dependent, while the ones for the second ones are not. We are going to examine each of them in separate paragraphs.
链接:
https://tomassetti.me/guide-natural-language-processing/
原文链接:
https://m.weibo.cn/1715118170/4174604782463416