【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

2020 年 4 月 8 日 深度学习自然语言处理

 This note summarised the main points of Few-Shot Text Classification with Distributional Signatures (ICLR 2020). Hope it would boost your understanding of the original paper. (They also released the code.) Please be free to correct me, if there are mistakes.


 Please TAKE CARE and STAY SAFE from COVID-19.


 Enjoy the Super Pink Moon tonight!


Prior Knowledge

The knowledge of few-shot learning is NOT required. The only thing you need to know is that the general procedure of training neural networks and linear regression. You may notice that they used attentions. Please DO NOT worry what kind of attention they are using, I will explain it with a clear example.


The note includes:

- What problem is this work solving <--

- How to train the model (with a step-ty-step example) <--

- How to test the model (with a step-ty-step example)

- What attention is it using (with a step-ty-step example)



 Problem

The situation is that we have many instances for some classes. But we only have a few (the most extreme case is only one) training instances for other classes


For example, let us say that we have 2 datasets. In the first dataset, we have 3 classes and each class has many instances for training a classifier. However, we do not have so much training instances for the other 2 totally different classes in the second dataset as shown below.


The research problem is that we have a model trained on the 1st dataset and whether this model can adapt quickly to the 2nd dataset. In other words, we will retrain the trained model based on the 2nd dataset and see if it will perform well in the case where we only have a few instances for training the 2 new classes.


You may think about using transfer learning. But based on the performance results in the paper, in this case, their method achieved much better performance than using transfer learning (Table 1 in the paper). 


There is another thing worth mentioning is that solving this kind of problem might be easier in computer vision. Because the lower hidden layers are able to capture the low-level image features as shown below. However, for NLP tasks, it may be not so easy to capture such low-level features which can be generalised well to a new dataset.


All in all, the problem is how to train a model on a dataset that have many instances for each class and how to adapt this model to a new dataset which only has a few instances for each class. The classes in the new dataset are unseen in the first one.


 Train

In this section, we will focus on how to train the model by using the 1st dataset. As shown in the Figure below, there are 3 classes in the dataset. 

Step 1: 

We sample N classes from all the classes in the dataset. (Please note: here we use N=2 just for the purpose of demonstrating. It doesn't mean in practice N=2 is a good setting. Please find how many classes exactly sampled in their paper.)


Step 2:

For each class sampled previously, we sample K instances. The K instances are named support set in the paper. (Please note again: here we use K=1 just for the purpose of demonstrating!)


Step 3:

We convert each word in each sentence to a vector (i.e., word embedding). In this example, the instance x1 has 3 words while the x2 has 2 words.


Step 4:

In this step, we take as input the sampled instances of an Attention Generator. The output of this generator is the attention scores of words. The score indicates the importance of each word in an instance.  


We can safely SKIP the details of the Attention Generator since we will describe it later. For now, we can keep in mind that the output of this generator is attention scores.


Step 5:

Next, we calculate the weighted sum of the word embedding, by using the attention scores.


To summarise what we did so farwhat we did is actually just converting an instance (e.g., a sentence) to a vector


Step 6:

We feed the K=2 sentence vectors to a classifier. And we compute the model parameters which fit the 2 instances best. Because the classifier is a ridge regressor. Since we can find an analytical solution of the optimal parameters for a ridge regressor, the values of best parameters can be obtained immediately. 


If you don't understand what is a ridgregressor, you can simply google or Baidu it. You will figure it out quickly as it is not a complicated model. HOWEVER, you can safely skip understanding what is it and go ahead with reading the next Steps


Step 7:

At this point, we sample other L=2 instances for each sampled class. (Actually, in the sampling step, we sample totally L + K = 2 + 1 = 3 instances at the same time. And split them into 2 subsets. The set of K instances is called a support set mentioned in Step 2 and the set of L instances is called a query set.)


Please also note that setting L=2 is also just for demonstrating purpose. In practice, this number depends on your own case.


Step 8:

We run the same procedure as we did before to convert each instance to a vector.


Step 9:

We then feed these vectors to the classifier. Based on the predicted results, we compute the loss function (i.e., cross-entropy) on the instances in the query set and subsequently update the model parameters.

The model parameters here are actually the parameters in the Attention Generator. The reason is as follows:


Please note that the Attention Generator was used twice in each episode. We convert the K instances in the support set to vectors by utilising the Attention Generator. Next, these vectors are used to compute the analytical best parameters of the classifier. Then we convert the L instances in the query set to vectors by utilising the Attention Generator. We then compute the loss function based on the predicted results of the classifier. Therefore: 1) optimising the loss function involves the usage of Attention Generator twice on the support set and query set; 2) optimising the loss function is actually updating Attention Generator parameters. The classifier parameters can be solved analytically and computed by ourselves. 


Based on the description above, you may observe that: one advantage of using the analytical solution of the classifier parameters is that we can learn the Attention Generator in an end-to-end manner by just optimising the cross-entropy loss function.


- Ending -

Running the whole process from Step 1 to 9 one time is called one episode. In the training phase, running multiple episodes are needed in practice.


Next: we will discuss how they adapt the model to and evaluate this model to a new dataset that includes new classes which are unseen in the old dataset.


 Please include or cite the original link address: when you are printingdistributing or translating this article; when you are using any figure, table or example from here.  


交流学习,进群备注: 昵称-学校(公司)-方向,进入DL&NLP交流群。
方向有很多: 机器学习、深度学习,python,情感分析、意见挖掘、句法分析、机器翻译、人机对话、知识图谱、语音识别等
广告商、博主勿入!


登录查看更多
7

相关内容

文本分类(Text Classification)任务是根据给定文档的内容或主题,自动分配预先定义的类别标签。
零样本文本分类,Zero-Shot Learning for Text Classification
专知会员服务
96+阅读 · 2020年5月31日
Transformer文本分类代码
专知会员服务
117+阅读 · 2020年2月3日
机器学习入门的经验与建议
专知会员服务
94+阅读 · 2019年10月10日
自然语言处理常见数据集、论文最全整理分享
深度学习与NLP
11+阅读 · 2019年1月26日
强化学习的Unsupervised Meta-Learning
CreateAMind
17+阅读 · 2019年1月7日
Unsupervised Learning via Meta-Learning
CreateAMind
42+阅读 · 2019年1月3日
分布式TensorFlow入门指南
机器学习研究会
4+阅读 · 2017年11月28日
【推荐】自然语言处理(NLP)指南
机器学习研究会
35+阅读 · 2017年11月17日
【学习】(Python)SVM数据分类
机器学习研究会
6+阅读 · 2017年10月15日
【推荐】用Tensorflow理解LSTM
机器学习研究会
36+阅读 · 2017年9月11日
【推荐】SVM实例教程
机器学习研究会
17+阅读 · 2017年8月26日
【推荐】TensorFlow手把手CNN实践指南
机器学习研究会
5+阅读 · 2017年8月17日
Auto-Encoding GAN
CreateAMind
7+阅读 · 2017年8月4日
Arxiv
7+阅读 · 2020年3月1日
Learning to Weight for Text Classification
Arxiv
8+阅读 · 2019年3月28日
Arxiv
12+阅读 · 2018年9月15日
VIP会员
相关资讯
自然语言处理常见数据集、论文最全整理分享
深度学习与NLP
11+阅读 · 2019年1月26日
强化学习的Unsupervised Meta-Learning
CreateAMind
17+阅读 · 2019年1月7日
Unsupervised Learning via Meta-Learning
CreateAMind
42+阅读 · 2019年1月3日
分布式TensorFlow入门指南
机器学习研究会
4+阅读 · 2017年11月28日
【推荐】自然语言处理(NLP)指南
机器学习研究会
35+阅读 · 2017年11月17日
【学习】(Python)SVM数据分类
机器学习研究会
6+阅读 · 2017年10月15日
【推荐】用Tensorflow理解LSTM
机器学习研究会
36+阅读 · 2017年9月11日
【推荐】SVM实例教程
机器学习研究会
17+阅读 · 2017年8月26日
【推荐】TensorFlow手把手CNN实践指南
机器学习研究会
5+阅读 · 2017年8月17日
Auto-Encoding GAN
CreateAMind
7+阅读 · 2017年8月4日
Top
微信扫码咨询专知VIP会员