The structure of the neural network
A neuron can be a binary logistic regression unit
b: We can have an “always on” feature, which gives a class prior, or separate it out, as a bias term------b我们常常认为是偏置
A neural network = running several logistic regressions at the same time
Matrix notation for a layer--矩阵表示
命名主体识别Named Entity Recognition (NER)
The task: findand classifynames in text, for example:
Tracking mentions of particular entities in documents---跟踪文档中特殊的实体
For question answering, answers are usually named entities------回答一些关于命名主体识别的问题
A lot of wanted information is really associations between named entities---抽取命名主体之间关系
The same techniques can be extended to other slot-filling classifications----可以扩展到分类任务
Why might NER be hard?
Binary word window classification ---小窗口上下文文本分类器
ambiguity arise in context,一词多义的问题
"To sanction" can mean "to permit" or "to punish”
"To seed" can mean "to place seeds" or "to remove seeds"
example2:resolving linking of ambiguous named entities
Paris ->Paris, France vs. Paris Hilton vs. Paris, Texas
Hathaway ->Berkshire Hathaway vs. Anne Hathaway
Window classification: Softmax
Idea: classify a word in its context window of neighboring words---在相邻词的上下文窗口对一个词进行分类
A simple way to classify a word in context might be to average the word vectors in a window and to classify the average vector ---一个简单方法是对上下文的所有词向量去平均,但是这个方法会丢失位置信息
另一种方法: Train softmaxclassifier to classify a center word by taking concatenation of word vectors surrounding it in a window---将上下文所有单词的词向量串联起来
for example: Classify “Paris” in the context of this sentence with window length 2
Resulting vector $x_{window}=x\varepsilon R^{5d}$一个列向量
Binary classification with unnormalizedscores ---给分类的结果一个非标准化分数
the positions that have an actual NER Location in their center are “true” positions and get a high score ---它们中符合标准的会获得最高分
Neural Network Feed-forward Computation
s = score("museums in Paris are amazing”)
Main intuition for extra layer
中间层的作用学习的是输入词向量的非线性交互----Example: only if “museums”is first vector should it matter that “in”is in the second position
The max-margin loss
Idea for training objective: Make true window’s score larger and corrupt window’s score lower (until they’re good enough)---训练思路,让真实窗口的准确率提高,让干扰窗口的得分降低
s = score(museums in Paris are amazing)
Each window with an NER location at its center should have a score +1 higher than any window without a location at its center -----每个中心有ner位置的窗口得分应该比中心没有位置的窗口高1分
For full objective function: Sample several corrupt windows per true one. Sum over all training windows---使用类似于负采样的方法,为真实窗口采样几个错误窗口
Example Jacobian: Elementwise activation Function
note: Neural Networks, Backpropagation
Neural Networks: Foundations
A neuron
A neuron is a generic computational unit that takes n inputs and produces a single output. What differentiates the outputs of different neurons is their parameters (also referred to as their weights). -----神经元作用
A single layer of neurons
If we refer to the different neurons’ weights as
and the biases as
, we can say the respective activations are
feed-forward computation
"Museums in Paris are amazing"
我们想判别中心词Paris是不是命名主体。在这种情况下,我们很可能不仅想要捕捉窗口中单词向量,还想要捕捉单词间的一些其他交互,方便我们分类。For instance, maybe it should matter that "Museums" is the first word only if "in" is the second word. --上面可能存在顺序的约束的问题。所以这样的非线性决策通常不能被直接输入softmax,而是需要一个中间层进行score。因此我们使用另一个矩阵
Analysis of Dimensions: If we represent each word using a 4 dimensional word vector and we use a 5-word window as input (as in the above example), then the input
. If we use 8 sigmoid units in the hidden layer and generate 1 score output from the activations, then
Maximum Margin Objective Function
我们采用:Maximum Margin Objective Function(最大间隔目标函数),使得保证对‘真’的数据的可能性比‘假’数据的可能性要高
定义符号:Using the previous example, if we call the score computed for the "true" labeled window "Museums in Paris are amazing" as S and the score computed for the "false" labeled window "Not all museums in Paris" as Sc (subscripted as c to signify that the window is "corrupt"). ---正确窗口S,错误窗口Sc
However, the above optimization objective is risky in the sense that it does not attempt to create a margin of safety. We would want the "true" labeled data point to score higher than the "false" labeled data point by some positive margin ∆. In other words, we would want error to be calculated if (s−sc < ∆) and not just when (s−sc < 0). Thus, we modify the optimization objective: ----上面的优化目标函数是存在风险的,它不能创造一个比较安全的间隔,所以我们希望存在一个这样的间隔,并且这个间隔需要大于0
We can scale this margin such that it is ∆ = 1 and let the other parameters in the optimization problem adapt to this without any change in performance.-----有希望了解的可以看svm的推导,这里的意思是说,我们把间隔设置为1,这样我们可以让其他参数在优化过程中自动进行调整,并不会影响模型的表现
Training with Backpropagation – Vectorized
Neural Networks: Tips and Tricks
Gradient Check
Given a model with parameter vector θ and loss function J, the numerical gradient around θi is simply given by centered difference formula:
Now, a natural question you might ask is, if this method is so precise, why do we not use it to compute all of our network gradients instead of applying back-propagation? The simple answer, as hinted earlier, is inefficiency – recall that every time we want to compute the gradient with respect to an element, we need to make two forward passes through the network, which will be computationally expensive. Furthermore, many large-scale neural networks can contain millions of parameters, and computing two passes per parameter is clearly not optimal. -----虽然上面的梯度估计公式很有效,但是这仅仅是随机检测我们梯度是否正确的方法。我们最有效的并且最实用(运算效率最高的)算法就是反向传播算法
Regularization ---正则化
As with many machine learning models, neural networks are highly prone to overfitting, where a model is able to obtain near perfect performance on the training dataset, but loses the ability to generalize to unseen data. ----和很多机器学习模型一样,神经网络也会陷入过拟合,这回让它无法泛化到测试集上。一个常见的解决过拟合的问题就是采用L2正则化(只需要给损失函数J添加一个正则项),改进的损失函数
正则项的作用: what regularization is essentially doing is penalizing weights for being too large while optimizing over the original cost function---在优化损失函数的时候,惩罚数值太大的权值,让权值分配更均匀,防止出现权值过大的情况
Due to the quadratic nature of the Frobenius norm (which computes the sum of the squared elements of a matrix), L2 regularization effectively reduces the flexibility of the model and thereby reduces the overfitting phenomenon. Imposing such a constraint can also be interpreted as the prior Bayesian belief that the optimal weights are close to zero – how close depends on the value of λ.-----因为正则化有一个二次项的存在,这有利有弊,它会降低模型的灵活性但是也会降低过拟合的可能性。在贝叶斯学说的认为下,正则项可以优化权值并且使得其接近0,但是这个取决于你的一个λ值的大小
Too high a value of λ causes most of the weights to be set too close to 0, and the model does not learn anything meaningful from the training data, often obtaining poor accuracy on training, validation, and testing sets. ---λ取值要合适
the idea is simple yet effective – during training, we will randomly “drop” with some probability (1−p) a subset of neurons during each forward/backward pass (or equivalently, we will keep alive each neuron with a probability p). Then, during testing, we will use the full network to compute our predictions. The result is that the network typically learns more meaningful information from the data, is less likely to overfit, and usually obtains higher performance overall on the task at hand. One intuitive reason why this technique should be so effective is that what dropout is doing is essentially doing is training exponentially many smaller networks at once and averaging over their predictions.---------dropout的思想就是在每次前向传播或者反向传播的时候我们按照一定的概率(1-P)冻结神经元,但是剩下概率为p的神经元是激活的,然后在测试阶段,我们使用所有的神经元。使用dropout的网络可以从数据中学到更多的知识
However, a key subtlety is that in order for dropout to work effectively, the expected output of a neuron during testing should be approximately the same as it was during training – else the magnitude of the outputs could be radically different, and the behavior of the network is no longer well-defined. Thus, we must typically divide the outputs of each neuron during testing by a certain value --------为了使得的dropout能够有效,测试阶段的神经元的预期输出应该和训练阶段大致相同---否则输出的大小存在很大差异,所以我们通常需要在测试阶段将每个神经元的输出除以P(P是存活神经元的概率)
Parameter Initialization--参数初始化
A key step towards achieving superlative performance with a neural network is initializing the parameters in a reasonable way. A good starting strategy is to initialize the weights to small random numbers normally distributed around 0 ---通常我们的权值随机初始化在0附近
进群请添加AINLP小助手微信 AINLPer(id: ainlper),备注NLP技术交流
征稿启示| 200元稿费+5000DBC(价值20个小时GPU算力)
Node2Vec 论文+代码笔记
AINLP 是一个有趣有AI的自然语言处理社区,专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享,主题包括文本摘要、智能问答、聊天机器人、机器翻译、自动生成、知识图谱、预训练模型、推荐系统、计算广告、招聘信息、求职经验分享等,欢迎关注!加技术交流群请添加AINLPer(id:ainlper),备注工作/研究方向+加群目的。