PinSAGE召回模型及源码分析(2)：数据管道

2020 年 12 月 1 日 AINLP

Enough talk, show me the codes !!!

DGL版的PinSAGE源码见GitHub。这部分的代码写得有层次，而不似一些toy example那样将程序的各环节都杂揉一处。接下来的代码分析将分“训练数据供应”、“模型各模块”、“训练”三部分展开。本章是第一部分，讲解训练过程中，数据是如何提供给模型的。

通过上一节对PinSAGE的分析，可以发现PinSAGE在实现模型时，与普通的GraphSAGE并无区别，而其主要改进都是发生在"供应训练数据"这一环节，包括：生成mini-batch、负样本采样、为mini-batch生成卷积各层需要的计算子图、将计算子图中的相应边删除等工作，是理解PinSAGE的核心。这些工作都发生在sampler.py中。看sampler.py代码，重要的是区分代码中出现的各种概念，为此，有必要先梳理一下sampler.py中出现的概念。

梳理概念

g：原图

最原始由user+item组成的二部图
Neighbor Sampling和Negative Sampling都发生在原图上，但是只发生在原图的部分节点上。比如DGL样例是为了实现item2item召回功能，因此两种采样都只发生在原图的item type节点上
与batch无关，与卷积的层数无关。

heads, tails, neg_tails

heads：每次从原图所有item节点中采样出batch_size个item节点
tails：由这heads个item节点出发，经过两跳的随机游走(item→user→item)再落脚的item节点。因为这些节点与heads节点被共同的user消费过，认为有内在相似性，作为heads的 正样本
neg_tails：每次从原图所有item节点中，再采出batch_size个item节点。这部分随机采样的节点作为heads的 负样本。
属于某一个batch，但是由哪一层卷积无关。

pos_graph和neg_graph

只由heads, tails, neg_tails这些节点构成一个局部图。因为是部分节点，因此不遵循原图中item节点的编号空间，需要重新编号
这个局部图中由heads→tails的边构成了pos_graph。这些边上的分数是参与pairwise loss中的positive score。
这个局部图中由heads→neg_tails的边构成了neg_graph。这些边上的分数是参与pairwise loss中的negative score。
注意，pos_graph与neg_graph中的节点是相同的，所以 seeds = pos_graph.ndata[dgl.NID]能够代表这个局部图中的所有节点，即 seeds = heads + tails + neg_tails。
注意， pos_graph与neg_graph中边的数目是相同的，一条正边只与一条负边对应。这是一个缺陷，因为实践中，往往需要一条正边与多条负边比较。
属于某一个batch，但是由哪一层卷积无关。

frontier

属于某个batch中的某一层卷积
在原图g上（不是在pos_graph或neg_graph上），以seeds item node为起点，通过random walk进行重要性采样，得到seeds最重要的邻居。 包含了所有item节点(因此item编号与g原图中相同)，但是只在seed item节点和其最重要的邻居之间，才有边所构成的图，称为frontier。
由于neighbor sampling是从顶部倒着向底部进行，所以第N层的seeds就是第N+1层的input nodes，最顶部一层的seeds就是heads+tails+neg_tails构成的所有item节点
如果只是预测，以上过程就已经足够了。但是在训练中，为了防止信息泄漏，还需要将frontier中有可能存在的“正边”(heads⇒tails)和“负边”(heads⇒neg_tails)统统删除。

block

属于某个batch中的某一层卷积
为了信息传递之用的一种特殊二部图结构
frontier中还是包含了原图中所有item节点，而由frontier生成的 block只包含了seeds节点、指向seeds节点的邻居、它们中间的边。
因为block去除了无关节点，因此信息传递起来更高效，但是需要给节点重要编号
相比于传统图中的src/dst节点，block中更常用的概念是input/output nodes，而且所有output nodes都排列在input nodes中的头部。
blocks[0].srcnodes，代表为计算出目标节点（这里的heads + tails + neg_tails），必须参与计算的全部输入节点。
blocks[-1].dstnodes，代表我们感兴趣的目标节点（这里的heads + tails + neg_tails）

供应数据的入口

见model.py中的train函数

# 负责采样出三个batch_size大小的节点列表: heads, tails,  neg_tails
batch_sampler = sampler_module.ItemToItemBatchSampler(g, user_ntype, item_ntype, args.batch_size)

# 负责真正neighor sampling的逻辑
# 根据batch_sampler提供的一个batch中的heads, tails, neg_tails
# 由heads-->tails构成positive graph
# 由heads-->neg_tails构成negative graph
# 再由heads+tails+neg_tails反向搜索，构建每层卷积所需要的block
neighbor_sampler = sampler_module.NeighborSampler(
    g, user_ntype, item_ntype, args.random_walk_length,
    args.random_walk_restart_prob, args.num_random_walks, args.num_neighbors,
    args.num_layers)

# 逻辑并不重，给定一个batch，
# 1. 调用neighbor_sampler为这个batch中的heads,tails,neg_tails
# 2. 根据heads,tails,neg_tails, 生成pos_graph,neg_graph和blocks，
# 3. 然后将原图中节点的特征拷贝进blocks中的各个节点
collator = sampler_module.PinSAGECollator(neighbor_sampler, g, item_ntype, textset)

dataloader = DataLoader(
    batch_sampler,# 每次调用生成一个batch，包含heads, tails, 和neg_tails
    collate_fn=collator.collate_train,# 由heads+tails+和neg_tails生成pos_graph, neg_graph和blocks
    num_workers=args.num_workers)

dataloader_test = DataLoader(
    torch.arange(g.number_of_nodes(item_ntype)),# 原图中所有item node
    batch_size=args.batch_size,
    # 只生成blocks。注意这个函数只能用于训练时的测试，并不能用于生成上线用的向量
    # 因为其中生成block也用到了邻居采样
    # 而真正上线用的向量，必须拿一个节点的所有邻居进行卷积得到
    collate_fn=collator.collate_test,
    num_workers=args.num_workers)

ItemToItemBatchSampler

见sampler.py中的ItemToItemBatchSampler类。负责从所有item节点中采样，生成一个batch中的三种节点，heads, tails, neg_tails。

class ItemToItemBatchSampler(IterableDataset):

    def __iter__(self):
        while True:
            # 随机采样做heads
            heads = torch.randint(0, self.g.number_of_nodes(self.item_type), (self.batch_size,))

            # 二跳游走，得到与heads被同一个用户消费过的其他item，做正样本
            # 还是有很多不足，
            # 1. 这种游走肯定会使正样本集中于少数热门item
            # 2. 如果item只被一个用户消费过，二跳游走岂不是又回到起始item，这种corner case还是要处理的 
            tails = dgl.sampling.random_walk(
                self.g,
                heads,
                metapath=[self.item_to_user_etype, self.user_to_item_etype])[0][:, 2]

            # 随机采样做负样本
            # 没有hard negative也是可以接受的
            # 但是万一随机采样的，的确被同一个用户消费过，这种corner case怎么处理？
            neg_tails = torch.randint(0, self.g.number_of_nodes(self.item_type), (self.batch_size,))

            mask = (tails != -1)
            yield heads[mask], tails[mask], neg_tails[mask]

NeighborSampler.sample_blocks

这个函数负责，由seeds（实际上就是batch中的heads+tails+neg_tails）回溯生成各层卷积需要的block

需要注意两个地方：

基于随机游走的重要邻居采样，已经由DGL实现在dgl.sampling.PinSAGESampler这个类中了，文档写得很多清楚，"The edges of the returned homogeneous graph will connect to the given nodes from their most commonly visited nodes, with a feature indicating the number of visits"。
注意下面代码中，先将head tails,head neg_tails从frontier中先删除，再生成block，避免信息泄漏。

class NeighborSampler(object):
    def __init__(self, g, user_type, item_type, random_walk_length, random_walk_restart_prob,
                 num_random_walks, num_neighbors, num_layers):
        self.g = g
        ......
        # 每层都有一个采样器，根据随机游走来决定某节点邻居的重要性
        # 可以认为经过多次游走，落脚于某邻居节点的次数越多，则这个邻居越重要，就更应该优先作为邻居
        self.samplers = [
            dgl.sampling.PinSAGESampler(g, item_type, user_type, random_walk_length,
                random_walk_restart_prob, num_random_walks, num_neighbors)
            for _ in range(num_layers)]

    def sample_blocks(self, seeds, heads=None, tails=None, neg_tails=None):
        blocks = []
        for sampler in self.samplers:
            frontier = sampler(seeds)# 通过随机游走进行重要性采样，生成中间状态frontier

            if heads is not None:
                # 如果是在训练，需要将heads->tails和head->neg_tails这些待预测的边都去掉，防止信息泄漏
                eids = frontier.edge_ids(torch.cat([heads, heads]), torch.cat([tails, neg_tails]), return_uv=True)[2]
                if len(eids) > 0:
                    old_frontier = frontier
                    frontier = dgl.remove_edges(old_frontier, eids)

            # 只保留seeds这些节点，将frontier压缩成block
            # 并设置block的input/output nodes
            block = compact_and_copy(frontier, seeds)
      
            # 本层的输入节点就是下一层的seeds
            seeds = block.srcdata[dgl.NID]
            blocks.insert(0, block)
        return blocks

NeighborSampler.sample_from_item_pairs

这个函数返回:

由heads→tails生成的pos_graph，用于计算pairwise loss中的pos_score
由heads→neg_tails生成的neg_graph，用于计算pairwise loss中的neg_score
用pos_graph的全部节点 (也同样是neg_graph中的全部节点， 实际上就是batch中的heads+tails+neg_tails)，调用sample_blocks生成各层卷积所需要的block

class NeighborSampler(object):
    def __init__(self, g, user_type, item_type, random_walk_length, random_walk_restart_prob,
                 num_random_walks, num_neighbors, num_layers):
        ......

    def sample_blocks(self, seeds, heads=None, tails=None, neg_tails=None):
        ......

    def sample_from_item_pairs(self, heads, tails, neg_tails):
        # 由heads->tails构建positive graph
        # num_nodes设置成原图中所有item节点
        pos_graph = dgl.graph(
            (heads, tails),
            num_nodes=self.g.number_of_nodes(self.item_type))

        # 由heads->neg_tails构建negative graph
        # num_nodes设置成原图中所有item节点
        neg_graph = dgl.graph(
            (heads, neg_tails),
            num_nodes=self.g.number_of_nodes(self.item_type))

        # 去除heads, tails, neg_tails以外的节点
        # 将大图压缩成小图，避免不必要的信息传递，提升计算效率
        pos_graph, neg_graph = dgl.compact_graphs([pos_graph, neg_graph])

        # 压缩后的图上的节点在原图中的编号
        # 注意这时pos_graph与neg_graph不是分开编号的两个图
        # 它们来自于同一幅由heads, tails, neg_tails组成的大图
        # pos_graph和neg_graph中的节点相同，都是heads+tails+neg_tails，即这里的seeds
        # pos_graph和neg_graph只是边不同而已
        seeds = pos_graph.ndata[dgl.NID]

        blocks = self.sample_blocks(seeds, heads, tails, neg_tails)

        return pos_graph, neg_graph, blocks

未完待续

本章分析了样例中是如何将数据、计算图喂入模型的，接下来将分析PinSAGE模型的实现。

由于微信平台算法改版，公号内容将不再以时间排序展示，如果大家想第一时间看到我们的推送，强烈建议星标我们和给我们多点点【在看】。星标具体步骤为：

（1）点击页面最上方"AINLP"，进入公众号主页。

（2）点击右上角的小点点，在弹出页面点击“设为星标”，就可以啦。

感谢支持，比心。

欢迎加入AINLP技术交流群

进群请添加AINLP小助手微信 AINLPer（id: ainlper)，备注NLP技术交流

推荐阅读

这个NLP工具，玩得根本停不下来

征稿启示| 200元稿费+5000DBC（价值20个小时GPU算力）

完结撒花！李宏毅老师深度学习与人类语言处理课程视频及课件（附下载）

从数据到模型，你可能需要1篇详实的pytorch踩坑指南

如何让Bert在finetune小数据集时更“稳”一点

模型压缩实践系列之——bert-of-theseus，一个非常亲民的bert压缩方法

文本自动摘要任务的“不完全”心得总结番外篇——submodular函数优化

Node2Vec 论文+代码笔记

模型压缩实践收尾篇——模型蒸馏以及其他一些技巧实践小结

中文命名实体识别工具（NER）哪家强？

学自然语言处理，其实更应该学好英语

斯坦福大学NLP组Python深度学习自然语言处理工具Stanza试用

关于AINLP

AINLP 是一个有趣有AI的自然语言处理社区，专注于 AI、NLP、机器学习、深度学习、推荐算法等相关技术的分享，主题包括文本摘要、智能问答、聊天机器人、机器翻译、自动生成、知识图谱、预训练模型、推荐系统、计算广告、招聘信息、求职经验分享等，欢迎关注！加技术交流群请添加AINLPer(id：ainlper)，备注工作/研究方向+加群目的。