标签传播算法(Label Propagation Algorithm)是基于图的半监督学习方法,基本思路是从已标记的节点的标签信息来预测未标记的节点的标签信息,利用样本间的关系,建立完全图模型。
LabelPropagation和LabelSpreading对图的相似度矩阵( similarity matrix)的修改,以及对标签分布的箝位效应( the clamping effect )。箝位允许算法在一定程度上改变真的标签数据的权重。LabelPropagation算法对输入标签进行硬箝位(hard clamping),也就是a = 0,LabelPropagation算法对输入标签进行硬箝位。这个箝位系数可以放宽,可以说是a = 0.2,这意味着我们将始终保留80%的原始标签分布,但算法得到的置信度在20%以内。
LabelPropagation使用从数据中构建的原始相似度矩阵(the raw similarity matrix),不做任何修改。相比之下,LabelSpreading将具有正则化特性的损失函数最小化,因此它通常对噪声更稳健(划重点)。该算法对原始图的修改版本进行迭代,并通过计算归一化的图Laplacian矩阵对边缘权重进行归一化。这个过程也被用于频谱聚类(Spectral clustering)中。
LabelSpreading类似于基本的标签传播算法(The basic Label Propagation algorithm),但使用了基于归一化的graph Laplacian和soft clamping 的亲和矩阵(affinity matrix)在标签间进行传播。
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.datasets import fetch_20newsgroups
from sklearn import datasets
from sklearn.semi_supervised import LabelSpreading
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
categories = [
newsgroup_train = fetch_20newsgroups(subset = 'all',categories = categories)
rng = np.random.RandomState(0)
indices = np.arange(len(newsgroup_train.target))
vectorizer = TfidfVectorizer( stop_words = 'english',
max_df = 0.65,
fea_train = vectorizer.fit_transform(newsgroup_train.data)
y_train = newsgroup_train.target
我们首先训练一个标签传播模型(label propagation model),只用300个标签点进行训练,然后选择前10个最不确定(most uncertain)的点进行标签传播。接下来,我们用这310个标签点进行训练(原始的300个点+10个新点),我们重复这个过程若20次,可以得到数量可观的标记数据。
test_num = 2000
X = fea_train[indices[:test_num ]]
y = y_train[indices[:test_num ]]
images = np.array(newsgroup_train.data)[indices[:test_num]]
n_total_samples = len(y)
n_labeled_points = 300
max_iterations = 20
unlabeled_indices = np.arange(n_total_samples)[n_labeled_points:]
array([ 100, 101, 102, ..., 1997, 1998, 1999])
for i in range(max_iterations):
if len(unlabeled_indices) == 0:
y_train = np.copy(y)
y_train[unlabeled_indices] = -1
lp_model = LabelSpreading(
alpha = 0.5,
n_neighbors =15,
n_jobs = -1
lp_model.fit(X.toarray(), y_train)
predicted_labels = lp_model.transduction_[unlabeled_indices]
true_labels = y[unlabeled_indices]
cm = confusion_matrix(true_labels, predicted_labels,
print("【迭代轮次】 %i %s" % (i, 70 * "_"))
print("LabelSpreading model: %d 个已标记 & %d 个未标记 (%d 个总数)"
% (n_labeled_points, n_total_samples - n_labeled_points,
target_names = [
# compute the entropies of transduced label distributions
pred_entropies = stats.distributions.entropy(
# select up to 10 digit examples that the classifier is most uncertain about
uncertainty_index = np.argsort(pred_entropies)[::-1]
uncertainty_index = uncertainty_index[
np.in1d(uncertainty_index, unlabeled_indices)][:10]
# keep track of indices that we get labels for
delete_indices = np.array([], dtype=int)
for index, image_index in enumerate(uncertainty_index):
image = images[image_index]
if i < max_iterations:
print("预测标签: {}\n真实标签: {}".format(
newsgroup_train.target_names[lp_model.transduction_[image_index]], newsgroup_train.target_names[y[image_index]]))
# labeling 10 points, remote from labeled set
delete_index, = np.where(unlabeled_indices == image_index)
delete_indices = np.concatenate((delete_indices, delete_index))
unlabeled_indices = np.delete(unlabeled_indices, delete_indices)
n_labeled_points += len(uncertainty_index)
print('=========第 {} 轮结束~============'.format(i))
【迭代轮次】 0 ______________________________________________________________________
LabelSpreading model: 300 个已标记 & 1700 个未标记 (2000 个总数)
precision recall f1-score support
rec.autos 0.73 0.73 0.73 277
talk.politics.guns 0.65 0.81 0.72 272
talk.politics.mideast 0.75 0.76 0.75 273
rec.sport.baseball 0.91 0.80 0.85 323
comp.sys.mac.hardware 0.79 0.79 0.79 268
soc.religion.christian 0.93 0.83 0.87 287
accuracy 0.79 1700
macro avg 0.79 0.79 0.79 1700
weighted avg 0.80 0.79 0.79 1700
[[203 37 15 5 14 3]
[ 12 221 20 8 11 0]
[ 20 26 207 7 10 3]
[ 16 22 9 257 11 8]
[ 14 18 17 4 211 4]
[ 14 15 9 1 11 237]]
From: yoony@aix.rpi.edu (Young-Hoon Yoon)
Subject: Re: JFFO has gone a bit too far
Nntp-Posting-Host: aix.rpi.edu
Distribution: usa
Lines: 29
rats@cbnewsc.cb.att.com (Morris the Cat) writes:
>|>Would somebody please post evidence that the gun control act of
>|>1968 is "a verbatim transcription" of a nazi law?
>|The "evidence" is that the two laws are basically identical.
>|However, that's not evidence that one is a copy of the other.
>|There's no evidence that the 68 GCA's authors used the nazi law as a
>|guide. Yes, they ended up with roughly the same thing, but that comes
>|from their shared goal, disarming those menacing minorities.
>I thought the same thing too, until JPFO's RKBA article
>in the latest Guns & Ammo
>at the newstands. This article makes it certain that Sen. Thomas Dodd
>(D-MD?) back before 1968 definitely asked for a translation of the
>German weapons laws back then. Read the article, and see what you think
>of JPFO's argument. They note that Ted Kennedy and John Dingell are
>among the three of the originals left from the 1968 stuff, and they
>are asking that folks request of John Dingell that he introduce
>legislation to lift GCA '68, something which I would support whole-
Can someone post a general idea of what GCA '68 does?
预测标签: comp.sys.mac.hardware
真实标签: talk.politics.guns
From: lau@aerospace.aero.org (David Lau)
Subject: Re: Accelerating the MacPlus...;)
Nntp-Posting-Host: michigan.aero.org
Organization: The Aerospace Corporation; El Segundo, CA
Lines: 17
Also, if someone would recommend another
> accelerator for the MacPlus, I'd like to hear about it.
> Thanks for any time and effort you expend on this!
> Karl
Try looking at the Brainstorm Accelerator for the Plus. I believe it is
the best solution because of the performance and price. Why spend $800
upgrading a computer that is only worth $300 ????
The brainstorm accelerator is around $225. It speeds up the internal
clock speed to 16MHz. That may not seem like much but it also speeds up
SCSI transfers. I think that feature is unique to brainstorm.
Check it out.
David Lau
预测标签: comp.sys.mac.hardware
真实标签: comp.sys.mac.hardware
From: C604223@mizzou1.missouri.edu (Cho Chuen Wong)
Subject: Performa Plus monitor
Nntp-Posting-Host: mizzou1.missouri.edu
Organization: University of Missouri
Lines: 3
I would like to know if a Performa Plus monitor is compatible with Apple 14in
Color Display, or it is just a VGA moniro. Any help will be appreciate.
预测标签: comp.sys.mac.hardware
真实标签: comp.sys.mac.hardware
From: murthy@ssdsun.asl.dl.nec.com (Vasudev Murthy)
Subject: Re: Saudi clergy condemns debut of human rights group!
Keywords: international, non-usa government, government, civil rights, social issues, politics
Nntp-Posting-Host: ssdsun
Organization: NEC America, Inc Irving TX
Lines: 21
In article <39898@optima.cs.arizona.edu> bakken@cs.arizona.edu (Dave Bakken) writes:
>Is this really what you (and Rached and others in the general
>west-is-evil-zionists-rule-hate-west-or-you-are-a-puppet crowd)
>want, Ilyess?
It's noteworthy that the posts about the west being
evil etc are made not in some Islamic hellhole but from
the west. If the west is so bad, why do they come here?
Notice how they comfortably exercise their rights to
free expression, something completely absent in their
own countries.
=========第 11 轮结束~============
【迭代轮次】 12 ______________________________________________________________________
LabelSpreading model: 420 个已标记 & 1580 个未标记 (2000 个总数)
precision recall f1-score support
rec.autos 0.79 0.80 0.80 256
talk.politics.guns 0.73 0.86 0.79 251
talk.politics.mideast 0.77 0.81 0.79 250
rec.sport.baseball 0.96 0.79 0.87 306
comp.sys.mac.hardware 0.80 0.83 0.81 247
soc.religion.christian 0.91 0.87 0.89 270
accuracy 0.83 1580
macro avg 0.83 0.83 0.83 1580
weighted avg 0.83 0.83 0.83 1580
[[204 24 11 1 13 3]
[ 9 217 11 2 11 1]
[ 12 22 203 2 8 3]
[ 15 14 18 241 11 7]
[ 9 9 15 1 205 8]
[ 8 10 5 3 9 235]]
【迭代轮次】 19 ______________________________________________________________________
LabelSpreading model: 890 个已标记 & 1110 个未标记 (2000 个总数)
precision recall f1-score support
rec.autos 0.93 0.96 0.95 159
talk.politics.guns 0.92 0.98 0.95 155
talk.politics.mideast 0.90 0.95 0.92 157
rec.sport.baseball 0.99 0.93 0.96 241
comp.sys.mac.hardware 0.97 0.96 0.96 177
soc.religion.christian 0.99 0.95 0.97 221
accuracy 0.95 1110
macro avg 0.95 0.96 0.95 1110
weighted avg 0.95 0.95 0.95 1110
[[153 3 2 0 1 0]
[ 1 152 1 0 1 0]
[ 1 5 149 0 2 0]
[ 6 1 6 225 2 1]
[ 0 2 4 0 170 1]
[ 3 3 4 2 0 209]]
