特别推荐|【文本挖掘系列教程】:
标签传播算法(Label Propagation Algorithm)是基于图的半监督学习方法,基本思路是从已标记的节点的标签信息来预测未标记的节点的标签信息,利用样本间的关系,建立完全图模型。
每个节点标签按相似度传播给相邻节点,在节点传播的每一步,每个节点根据相邻节点的标签来更新自己的标签,与该节点相似度越大,其相邻节点对其标注的影响权值越大,相似节点的标签越趋于一致,其标签就越容易传播。在标签传播过程中,保持已标记的数据的标签不变,使其将标签传给未标注的数据。最终当迭代结束时,相似节点的概率分布趋于相似,可以划分到一类中。
可用于分类和回归任务
LabelPropagation和LabelSpreading对图的相似度矩阵( similarity matrix)的修改,以及对标签分布的箝位效应( the clamping effect )。箝位允许算法在一定程度上改变真的标签数据的权重。LabelPropagation算法对输入标签进行硬箝位(hard clamping),也就是a = 0,LabelPropagation算法对输入标签进行硬箝位。这个箝位系数可以放宽,可以说是a = 0.2,这意味着我们将始终保留80%的原始标签分布,但算法得到的置信度在20%以内。
LabelPropagation使用从数据中构建的原始相似度矩阵(the raw similarity matrix),不做任何修改。相比之下,LabelSpreading将具有正则化特性的损失函数最小化,因此它通常对噪声更稳健(划重点)。该算法对原始图的修改版本进行迭代,并通过计算归一化的图Laplacian矩阵对边缘权重进行归一化。这个过程也被用于频谱聚类(Spectral clustering)中。
LabelSpreading类似于基本的标签传播算法(The basic Label Propagation algorithm),但使用了基于归一化的graph Laplacian和soft clamping 的亲和矩阵(affinity matrix)在标签间进行传播。
为了方便,笔者用于演示的数据来自sklearn自带的20newsgroups数据集,目前测试下来,这个半监督方法在长文本分类项目上效果奇好;如果是短文本的话,提取特征得用到当下最先进的transformer系预训练模型了。
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.datasets import fetch_20newsgroups
from sklearn import datasets
from sklearn.semi_supervised import LabelSpreading
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
categories = [
'rec.autos',
'talk.politics.guns',
'talk.politics.mideast',
'rec.sport.baseball',
'comp.sys.mac.hardware',
'soc.religion.christian']
newsgroup_train = fetch_20newsgroups(subset = 'all',categories = categories)
rng = np.random.RandomState(0)
indices = np.arange(len(newsgroup_train.target))
rng.shuffle(indices)
vectorizer = TfidfVectorizer( stop_words = 'english',
max_df = 0.65,
ngram_range=(1,2),
max_features=15000)
fea_train = vectorizer.fit_transform(newsgroup_train.data)
y_train = newsgroup_train.target
我们首先训练一个标签传播模型(label propagation model),只用300个标签点进行训练,然后选择前10个最不确定(most uncertain)的点进行标签传播。接下来,我们用这310个标签点进行训练(原始的300个点+10个新点),我们重复这个过程若20次,可以得到数量可观的标记数据。
当然,你可以通过改变max_iterations来标注更多的标签。标记更多的标签标签可以帮助我们了解这种主动学习技术的收敛速度。
test_num = 2000
X = fea_train[indices[:test_num ]]
y = y_train[indices[:test_num ]]
images = np.array(newsgroup_train.data)[indices[:test_num]]
n_total_samples = len(y)
n_labeled_points = 300
max_iterations = 20
unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:]
检视下未标注数据的index,注意这是随机的。
unlabeled_indices
array([ 100, 101, 102, ..., 1997, 1998, 1999])
for i in range(max_iterations):
if len(unlabeled_indices) == 0:
print("没有待打标的候选标签项")
break
y_train = np.copy(y)
y_train[unlabeled_indices] = -1
lp_model = LabelSpreading(
gamma=0.25,
kernel='knn',
alpha = 0.5,
n_neighbors =15,
max_iter=50,
n_jobs = -1
)
lp_model.fit(X.toarray(), y_train)
predicted_labels = lp_model.transduction_[unlabeled_indices]
true_labels = y[unlabeled_indices]
cm = confusion_matrix(true_labels, predicted_labels,
labels=lp_model.classes_)
print("【迭代轮次】 %i %s" % (i, 70 * "_"))
print("LabelSpreading model: %d 个已标记 & %d 个未标记 (%d 个总数)"
% (n_labeled_points, n_total_samples - n_labeled_points,
n_total_samples))
print(classification_report(
true_labels,
predicted_labels,
target_names = [
'rec.autos',
'talk.politics.guns',
'talk.politics.mideast',
'rec.sport.baseball',
'comp.sys.mac.hardware',
'soc.religion.christian']
))
print("【混淆矩阵】")
print(cm)
# compute the entropies of transduced label distributions
pred_entropies = stats.distributions.entropy(
lp_model.label_distributions_.T)
# select up to 10 digit examples that the classifier is most uncertain about
uncertainty_index = np.argsort(pred_entropies)[::-1]
uncertainty_index = uncertainty_index[
np.in1d(uncertainty_index, unlabeled_indices)][:10]
# keep track of indices that we get labels for
delete_indices = np.array([], dtype=int)
print('【最不确定样本呈现】\n',image)
for index, image_index in enumerate(uncertainty_index):
image = images[image_index]
if i < max_iterations:
print('……………'*5)
print("预测标签: {}\n真实标签: {}".format(
newsgroup_train.target_names[lp_model.transduction_[image_index]], newsgroup_train.target_names[y[image_index]]))
print('******************'*5)
# labeling 10 points, remote from labeled set
delete_index, = np.where(unlabeled_indices == image_index)
delete_indices = np.concatenate((delete_indices, delete_index))
unlabeled_indices = np.delete(unlabeled_indices, delete_indices)
n_labeled_points += len(uncertainty_index)
print('=========第 {} 轮结束~============'.format(i))
【迭代轮次】 0 ______________________________________________________________________
LabelSpreading model: 300 个已标记 & 1700 个未标记 (2000 个总数)
precision recall f1-score support
rec.autos 0.73 0.73 0.73 277
talk.politics.guns 0.65 0.81 0.72 272
talk.politics.mideast 0.75 0.76 0.75 273
rec.sport.baseball 0.91 0.80 0.85 323
comp.sys.mac.hardware 0.79 0.79 0.79 268
soc.religion.christian 0.93 0.83 0.87 287
accuracy 0.79 1700
macro avg 0.79 0.79 0.79 1700
weighted avg 0.80 0.79 0.79 1700
【混淆矩阵】
[[203 37 15 5 14 3]
[ 12 221 20 8 11 0]
[ 20 26 207 7 10 3]
[ 16 22 9 257 11 8]
[ 14 18 17 4 211 4]
[ 14 15 9 1 11 237]]
【最不确定样本呈现】
From: yoony@aix.rpi.edu (Young-Hoon Yoon)
Subject: Re: JFFO has gone a bit too far
Nntp-Posting-Host: aix.rpi.edu
Distribution: usa
Lines: 29
rats@cbnewsc.cb.att.com (Morris the Cat) writes:
>|>Would somebody please post evidence that the gun control act of
>|>1968 is "a verbatim transcription" of a nazi law?
>|The "evidence" is that the two laws are basically identical.
>|However, that's not evidence that one is a copy of the other.
>|There's no evidence that the 68 GCA's authors used the nazi law as a
>|guide. Yes, they ended up with roughly the same thing, but that comes
>|from their shared goal, disarming those menacing minorities.
>I thought the same thing too, until JPFO's RKBA article
>in the latest Guns & Ammo
>at the newstands. This article makes it certain that Sen. Thomas Dodd
>(D-MD?) back before 1968 definitely asked for a translation of the
>German weapons laws back then. Read the article, and see what you think
>of JPFO's argument. They note that Ted Kennedy and John Dingell are
>among the three of the originals left from the 1968 stuff, and they
>are asking that folks request of John Dingell that he introduce
>legislation to lift GCA '68, something which I would support whole-
>heartedly!
>|-andy
Can someone post a general idea of what GCA '68 does?
Thanks.
…………………………………………………………………
预测标签: comp.sys.mac.hardware
真实标签: talk.politics.guns
******************************************************************************************
【最不确定样本呈现】
From: lau@aerospace.aero.org (David Lau)
Subject: Re: Accelerating the MacPlus...;)
Nntp-Posting-Host: michigan.aero.org
Organization: The Aerospace Corporation; El Segundo, CA
Lines: 17
Also, if someone would recommend another
> accelerator for the MacPlus, I'd like to hear about it.
>
> Thanks for any time and effort you expend on this!
>
> Karl
Try looking at the Brainstorm Accelerator for the Plus. I believe it is
the best solution because of the performance and price. Why spend $800
upgrading a computer that is only worth $300 ????
The brainstorm accelerator is around $225. It speeds up the internal
clock speed to 16MHz. That may not seem like much but it also speeds up
SCSI transfers. I think that feature is unique to brainstorm.
Check it out.
David Lau
lau@aerospace.aero.org
…………………………………………………………………
预测标签: comp.sys.mac.hardware
真实标签: comp.sys.mac.hardware
******************************************************************************************
【最不确定样本呈现】
From: C604223@mizzou1.missouri.edu (Cho Chuen Wong)
Subject: Performa Plus monitor
Nntp-Posting-Host: mizzou1.missouri.edu
Organization: University of Missouri
Lines: 3
I would like to know if a Performa Plus monitor is compatible with Apple 14in
Color Display, or it is just a VGA moniro. Any help will be appreciate.
…………………………………………………………………
预测标签: comp.sys.mac.hardware
真实标签: comp.sys.mac.hardware
******************************************************************************************
【最不确定样本呈现】
From: murthy@ssdsun.asl.dl.nec.com (Vasudev Murthy)
Subject: Re: Saudi clergy condemns debut of human rights group!
Keywords: international, non-usa government, government, civil rights, social issues, politics
Nntp-Posting-Host: ssdsun
Organization: NEC America, Inc Irving TX
Lines: 21
In article <39898@optima.cs.arizona.edu> bakken@cs.arizona.edu (Dave Bakken) writes:
[deleted]
>
>Is this really what you (and Rached and others in the general
>west-is-evil-zionists-rule-hate-west-or-you-are-a-puppet crowd)
>want, Ilyess?
It's noteworthy that the posts about the west being
evil etc are made not in some Islamic hellhole but from
the west. If the west is so bad, why do they come here?
Notice how they comfortably exercise their rights to
free expression, something completely absent in their
own countries.
Vasudev
=========第 11 轮结束~============
【迭代轮次】 12 ______________________________________________________________________
LabelSpreading model: 420 个已标记 & 1580 个未标记 (2000 个总数)
precision recall f1-score support
rec.autos 0.79 0.80 0.80 256
talk.politics.guns 0.73 0.86 0.79 251
talk.politics.mideast 0.77 0.81 0.79 250
rec.sport.baseball 0.96 0.79 0.87 306
comp.sys.mac.hardware 0.80 0.83 0.81 247
soc.religion.christian 0.91 0.87 0.89 270
accuracy 0.83 1580
macro avg 0.83 0.83 0.83 1580
weighted avg 0.83 0.83 0.83 1580
【混淆矩阵】
[[204 24 11 1 13 3]
[ 9 217 11 2 11 1]
[ 12 22 203 2 8 3]
[ 15 14 18 241 11 7]
[ 9 9 15 1 205 8]
[ 8 10 5 3 9 235]]
【迭代轮次】 19 ______________________________________________________________________
LabelSpreading model: 890 个已标记 & 1110 个未标记 (2000 个总数)
precision recall f1-score support
rec.autos 0.93 0.96 0.95 159
talk.politics.guns 0.92 0.98 0.95 155
talk.politics.mideast 0.90 0.95 0.92 157
rec.sport.baseball 0.99 0.93 0.96 241
comp.sys.mac.hardware 0.97 0.96 0.96 177
soc.religion.christian 0.99 0.95 0.97 221
accuracy 0.95 1110
macro avg 0.95 0.96 0.95 1110
weighted avg 0.95 0.95 0.95 1110
【混淆矩阵】
[[153 3 2 0 1 0]
[ 1 152 1 0 1 0]
[ 1 5 149 0 2 0]
[ 6 1 6 225 2 1]
[ 0 2 4 0 170 1]
[ 3 3 4 2 0 209]]
可以看到,经过20个epoch的训练后,我们得到了数量可观的标注数据,同时,模型的准确度也在不断提升。对于其中机器拿不准的样例,我们得好好研究,发现其中的问题所在:是标注错误了?还是确实太相近了?或者是我们的分类体系本身就有问题!
在短文本分类任务中,上述方法得变通些,因为语义稀疏性嘛~特征抽取试试时下流行的bert、roberta、xnet等,试过的,有良好效果的记得和折耳喵勾兑分享下~
推荐阅读
征稿启示| 200元稿费+5000DBC(价值20个小时GPU算力)
完结撒花!李宏毅老师深度学习与人类语言处理课程视频及课件(附下载)
模型压缩实践系列之——bert-of-theseus,一个非常亲民的bert压缩方法
文本自动摘要任务的“不完全”心得总结番外篇——submodular函数优化
斯坦福大学NLP组Python深度学习自然语言处理工具Stanza试用
关于AINLP
AINLP 是一个有趣有AI的自然语言处理社区,专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享,主题包括文本摘要、智能问答、聊天机器人、机器翻译、自动生成、知识图谱、预训练模型、推荐系统、计算广告、招聘信息、求职经验分享等,欢迎关注!加技术交流群请添加AINLPer(id:ainlper),备注工作/研究方向+加群目的。
阅读至此了,分享、点赞、在看三选一吧🙏