When processing a batch of graphs in machine learning models such as Graph Neural Networks (GNN), it is common to combine several small graphs into one overall graph to accelerate processing and reduce the overhead of padding. This is for example supported in the PyG library. However, the sizes of small graphs can vary substantially with respect to the number of nodes and edges, and hence the size of the combined graph can still vary considerably, especially for small batch sizes. So the costs of excessive padding and wasted compute are still incurred. This paper proposes a new approach -- tuple packing -- for generating batches that cause minimal overhead. The algorithm extends recently introduced sequence packing approaches to work on the 2D tuples of (|nodes|, |edges|). A monotone heuristic is applied to the 2D histogram of tuple values to define a priority for packing histogram bins together with the objective to reach a limit on the number of nodes as well as the number of edges. Experiments verify the effectiveness of the algorithm on multiple datasets.
翻译:当处理像图形神经网络(GNN)这样的机器学习模型中的一批图表时,通常会将几个小图集合并成一个总图,以加速处理和减少铺设的间接费用。例如,PyG库支持这一点。但是,小图的大小在节点和边缘的数量方面可能有很大差异,因此,综合图的大小可能仍然有很大差异,特别是小批量大小。因此,仍然会出现过量铺设和浪费的计算费用。本文件提议了一种新办法 -- -- 拖斗包装 -- -- 用于产生产生造成最低间接费用的批次。算法最近对2D顶部( ⁇ nodes ⁇, ⁇ sedges ⁇ )的工作采用了序列包装方法。对 2D Tuple 值的外观值应用了单体外线外观,以界定包装直方图书箱的优先顺序,同时目标是达到节点数和边缘数的限度。实验验证了多个数据集的算法的有效性。