【RASA系列】语义理解（下）

2020 年 3 月 25 日 AINLP

Rasa是用于构建基于上下文的智能助手和聊天机器人的一套开源机器学习框架，Rasa有两个主要模块：
Rasa NLU ：对用户消息进行语义理解，包括意图识别和实体识别，它会把用户的输入转换为结构化的数据。
Rasa Core：用于对话管理（Dialogue management），决策下一步应该执行什么动作。
上一篇我们介绍了Rasa的基本操作、训练数据格式、一些已有的pipleline，和支持的语言。接下来我们会介绍一下如何提取实体，Rasa支持哪些组件，组合它们来更好的支持我们的chatbot。

实体提取

介绍

以下是可用提取器及其用途的介绍：

Component	Requires	Model	Notes
`CRFEntityExtractor`	sklearn-crfsuite	conditional random field	good for training custom entities
`SpacyEntityExtractor`	spaCy	averaged perceptron	provides pre-trained entities
`DucklingHTTPExtractor`	running duckling	context-free grammar	provides pre-trained entities
`MitieEntityExtractor`	MITIE	structured SVM	good for training custom entities
`EntitySynonymMapper`	existing entities	N/A	maps known synonyms

如果pipeline包括上述一个或多个组件，则模型的输出将包括提取的实体以及提取的一些元数据。该processors字段包含更改每个实体的组件的名称。这是一个示例响应：

{  "text": "show me chinese restaurants",  "intent": "restaurant_search",  "entities": [    {      "start": 8,      "end": 15,      "value": "chinese",      "entity": "cuisine",      "extractor": "CRFEntityExtractor",      "confidence": 0.854,      "processors": []    }  ]}

某些提取器（如duckling）可能包含其他信息。例如：

{  "additional_info":{    "grain":"day",    "type":"value",    "value":"2018-06-21T00:00:00.000-07:00",    "values":[      {        "grain":"day",        "type":"value",        "value":"2018-06-21T00:00:00.000-07:00"      }    ]  },  "confidence":1.0,  "end":5,  "entity":"time",  "extractor":"DucklingHTTPExtractor",  "start":0,  "text":"today",  "value":"2018-06-21T00:00:00.000-07:00"}

自定义实体

几乎每个聊天机器人和语音应用程序都会有一些自定义实体。餐饮助手应该将chinese理解为美食，但是对于语言学习助手来说，意义却大不相同。CRFEntityExtractor给定一些训练数据，该组件可以使用任何语言学习自定义实体。

提取地点，日期，人名，组织：spaCy具有针对几种不同语言经过预先训练的命名实体识别器。请注意，某些spaCy模型高度区分大小写。

日期，金额，期限，距离，序号：duckling库做了一下转换，如“next Thursday at 8pm”表述为实际的datetime对象

"next Thursday at 8pm" => {"value":"2018-05-31T20:00:00.000+01:00"}

正则表达式(regex)

可以使用正则表达式来帮助CRF模型识别实体。在训练数据中提供一个正则表达式列表，每个正则表达式都提供一个CRFEntityExtractor带有额外的二进制功能的正则表达式，该正则表达式说明是否找到了正则表达式（1）（0）。如果只想精确匹配正则表达式，则可以在收到Rasa NLU的响应后，在代码中执行此操作，作为后处理步骤。

组件

这是Rasa NLU中每个内置组件的配置选项的参考。

Components	type
Word Vector Sources	MitieNLP SpacyNLP
Featurizers	MitieFeaturizer SpacyFeaturizer NGramFeaturizer, RegexFeaturizer CountVectorsFeaturizer
Intent Classifiers	KeywordIntentClassifier MitieIntentClassifier SklearnIntentClassifier EmbeddingIntentClassifier
Selectors	Response Selector
Tokenizers	WhitespaceTokenizer JiebaTokenizer MitieTokenizer SpacyTokenizer
Entity Extractors	MitieEntityExtractor SpacyEntityExtractor EntitySynonymMapper CRFEntityExtractor DucklingHTTPExtractor

Word Vector Sources

MitieNLP

Short:	MITIE initializer
Outputs:	nothing
Requires:	nothing
Description:	初始化mitie结构。每个mitie组件都依赖于此，因此应将其放在使用任何mitie组件的每个pipeline的开头。
Configuration:	MITIE库需要一个语言模型文件，该文件必须在配置中指定：

pipeline:- name: "MitieNLP"  # language model to load  model: "data/total_word_feature_extractor.dat"

SpacyNLP

Short:	spacy language initializer
Outputs:	nothing
Requires:	nothing
Description:	初始化spacy结构。每个spacy组件都依赖于此，因此应将其放在使用任何spacy组件的每个pipeline的开头。
Configuration:	语言模型，默认情况下将使用配置的语言。如果要使用的模型spacy具有名称是从语言标签（不同`"en"`，`"de"`等），可使用此配置变量指定的型号名称。该名称将传递给`spacy.load(name)`。

pipeline:- name: "SpacyNLP"  # language model to load  model: "en_core_web_md"
  # when retrieving word vectors, this will decide if the casing  # of the word is relevant. E.g. `hello` and `Hello` will  # retrieve the same vector, if set to `false`. For some  # applications and models it makes sense to differentiate  # between these two words, therefore setting this to `true`.  case_sensitive: false

Featurizers

MitieFeaturizer

Short:	MITIE intent featurizer
Outputs:	nothing, 用作意图分类器的输入（例如`SklearnIntentClassifier`）
Requires:	MitieNLP
Description:	使用MITIE featurizer创建用于意图分类的功能。注意不使用的`MitieIntentClassifier`组件。当前，仅`SklearnIntentClassifier`能够使用预先计算的功能。
Configuration:	pipeline: - name: "MitieFeaturizer"

SpacyFeaturizer

Short:	spacy intent featurizer
Outputs:	nothing, 用作意图分类器的输入（例如`SklearnIntentClassifier`）
Requires:	SpacyNLP
Description:	使用spacy featurizer创建用于意图分类的功能。

NGramFeaturizer

Short:	将字符特征附加到特征向量
Outputs:	nothing, 将其特征附加到另一个意图特征器生成的现有特征向量上
Requires:	SpacyNLP
Description:	该特征化器将字符ngram特征附加到特征向量。在训练期间，组件会寻找最常见的字符序列（例如`app`或`ing`）。如果字符序列是否存在于单词序列中，则添加的功能表示布尔标志。注意在此管道之前，还需要另一个意图特征化器！
Configuration:	pipeline: - name: "NGramFeaturizer" # Maximum number of ngrams to use when augmenting # feature vectors with character ngrams max_number_of_ngrams: 10

RegexFeaturizer

Short:	创建正则表达式功能以支持意图和实体分类
Outputs:	`text_features` and `tokens.pattern`
Requires:	nothing
Description:	在训练期间，正则表达式会创建以训练数据格式定义的正则表达式列表。对于每个正则表达式，将设置一个功能来标记是否在输入中找到了此表达式，然后将其输入意图分类器/实体提取器中以简化分类（假设分类器在训练阶段已获悉，则此设置的功能表示一定的意图）。该`CRFEntityExtractor`组件当前仅支持用于实体提取的正则表达式功能！注意pipeline中的此功能化功能之前必须有一个token化功能！

意图分类器

KeywordIntentClassifier

Short:	简单的关键字匹配意图分类器
Outputs:	`intent`
Requires:	nothing
Output-Example:	`{ "intent": {"name": "greet", "confidence": 0.98343} }`
Description:	此分类器主要用作占位符。通过在传递的消息中搜索这些关键字，便能够识别出hello 和 goodbye的意图。

SklearnIntentClassifier

Short:	sklearn意图分类器
Outputs:	`intent` and `intent_ranking`
Requires:	A featurizer
Output-Example:	`{ "intent": {"name": "greet", "confidence": 0.78343}, "intent_ranking": [ { "confidence": 0.1485910906220309, "name": "goodbye" }, { "confidence": 0.08161531595656784, "name": "restaurant_search" } ] }`
Description:	sklearn意图分类器训练了一个线性SVM，该SVM使用网格搜索进行了优化。Spacy意图分类器需要在pipeline中添加特征符。该特征化器创建用于分类的功能。
Configuration:	在SVM训练期间，将运行超参数搜索以找到最佳参数集。在配置中，可以指定将尝试使用的参数

pipeline:- name: "SklearnIntentClassifier"  # Specifies the list of regularization values to  # cross-validate over for C-SVM.  # This is used with the ``kernel`` hyperparameter in GridSearchCV.  C: [1, 2, 5, 10, 20, 100]  # Specifies the kernel to use with C-SVM.  # This is used with the ``C`` hyperparameter in GridSearchCV.  kernels: ["linear"]

Selectors

响应选择

Short:	Response Selector
Outputs:	包含`response` 和 `ranking`
Requires:	A featurizer
Output-Example:	{ "text": "What is the recommend python version to install?", "entities": [], "intent": {"confidence": 0.6485910906220309, "name": "faq"}, "intent_ranking": [ {"confidence": 0.6485910906220309, "name": "faq"}, {"confidence": 0.1416153159565678, "name": "greet"} ], "response_selector": { "faq": { "response": {"confidence": 0.7356462617, "name": "Supports 3.5, 3.6 and 3.7, recommended version is 3.6"}, "ranking": [ {"confidence": 0.7356462617, "name": "Supports 3.5, 3.6 and 3.7, recommended version is 3.6"}, {"confidence": 0.2134543431, "name": "You can ask me about how to get started"} ] } } }
Description:	响应选择器组件可用于构建响应检索模型，根据一组候选响应直接预测机器人响应。该模型的预测由检索动作使用。它将用户输入和响应标签嵌入相同的空间，并遵循与完全相同的神经网络架构和优化`EmbeddingIntentClassifier`。响应选择器需要在管道中添加特征符。该特征化器创建用于嵌入的特征。建议使用`CountVectorsFeaturizer`，可以选择在其前面加上`SpacyNLP`。注意如果在预测时间内，一条消息仅包含训练中看不见的单词，并且未使用“词汇外”预处理器，则可以有把握地`None`预测出空响应`0.0`。
Configuration:	该算法包括所有`EmbeddingIntentClassifier`使用的超参数。此外，该组件还可以配置为针对特定的检索意图训练响应选择器`retrieval_intent`：设置为此响应选择器模型训练的意图的名称。默认`None`在配置中，您可以指定这些参数

分词器

JiebaTokenizer

Short:	Tokenizer using Jieba for Chinese language
Outputs:	nothing
Requires:	nothing
Description:	使用专用于中文的结巴分词器。对于除中文以外的语言，将作为 `WhitespaceTokenizer`。可用于为MITIE实体提取器token。通过`pip install jieba` 安装
Configuration:	用户的自定义词典文件可以通过以下方式通过文件的特定目录路径自动加载 `dictionary_path` pipeline: - name: "JiebaTokenizer" dictionary_path: "path/to/custom/dictionary/dir"

如果dictionary_path为None（默认），则将不使用任何自定义词典。

Entity Extractors

CRFEntityExtractor

Short:	CRF实体提取
Outputs:	输出 `entities`
Requires:	A tokenizer
Output-Example:	`{ "entities": [{"value":"New York City", "start": 20, "end": 33, "entity": "city", "confidence": 0.874, "extractor": "CRFEntityExtractor"}] }`
Description:	该组件实现条件随机场以进行命名实体识别。可以将CRF视为无向马尔可夫链，其中时间步长是单词，状态是实体类。单词的特征（大写，POS标记等）赋予某些实体类几率，相邻实体标签之间的转换也是如此：然后计算并返回最可能的一组标签。如果使用POS功能（pos或pos2），则必须安装spaCy。
Configuration:

pipeline:- name: "CRFEntityExtractor"  # The features are a ``[before, word, after]`` array with  # before, word, after holding keys about which  # features to use for each word, for example, ``"title"``  # in array before will have the feature  # "is the preceding word in title case?".  # Available features are:  # ``low``, ``title``, ``suffix5``, ``suffix3``, ``suffix2``,  # ``suffix1``, ``pos``, ``pos2``, ``prefix5``, ``prefix2``,  # ``bias``, ``upper``, ``digit`` and ``pattern``  features: [["low", "title"], ["bias", "suffix3"], ["upper", "pos", "pos2"]]
  # The flag determines whether to use BILOU tagging or not. BILOU  # tagging is more rigorous however  # requires more examples per entity. Rule of thumb: use only  # if more than 100 examples per entity.  BILOU_flag: true
  # This is the value given to sklearn_crfcuite.CRF tagger before training.  max_iterations: 50
  # This is the value given to sklearn_crfcuite.CRF tagger before training.  # Specifies the L1 regularization coefficient.  L1_c: 0.1
  # This is the value given to sklearn_crfcuite.CRF tagger before training.  # Specifies the L2 regularization coefficient.  L2_c: 0.1

译者介绍

王文彬，2018年毕业于中国科学院大学。毕业后加入贝壳找房语言智能与搜索部，主要从事NLP、强化学习和搜索推荐相关工作。

推荐阅读

AINLP年度阅读收藏清单

逆向而行，中文轻量级预训练模型的探索之路

From Word Embeddings To Document Distances 阅读笔记

模型压缩实践系列之——bert-of-theseus，一个非常亲民的bert压缩方法

这门斯坦福大学自然语言处理经典入门课，我放到B站了

可解释性论文阅读笔记1-Tree Regularization

关于AINLP

AINLP 是一个有趣有AI的自然语言处理社区，专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享，主题包括文本摘要、智能问答、聊天机器人、机器翻译、自动生成、知识图谱、预训练模型、推荐系统、计算广告、招聘信息、求职经验分享等，欢迎关注！加技术交流群请添加AINLP君微信(id：AINLP2)，备注工作/研究方向+加群目的。