Node classification is an important problem in graph data management. It is commonly solved by various label propagation methods that work iteratively starting from a few labeled seed nodes. For graphs with arbitrary compatibilities between classes, these methods crucially depend on knowing the compatibility matrix that must be provided by either domain experts or heuristics. Can we instead directly estimate the correct compatibilities from a sparsely labeled graph in a principled and scalable way? We answer this question affirmatively and suggest a method called distant compatibility estimation that works even on extremely sparsely labeled graphs (e.g., 1 in 10,000 nodes is labeled) in a fraction of the time it later takes to label the remaining nodes. Our approach first creates multiple factorized graph representations (with size independent of the graph) and then performs estimation on these smaller graph sketches. We define algebraic amplification as the more general idea of leveraging algebraic properties of an algorithm's update equations to amplify sparse signals. We show that our estimator is by orders of magnitude faster than an alternative approach and that the end-to-end classification accuracy is comparable to using gold standard compatibilities. This makes it a cheap preprocessing step for any existing label propagation method and removes the current dependence on heuristics.
翻译:结点分类是图表数据管理中的一个重要问题。 它通常由各种标签传播方法解决,这些标签传播方法从几个标签种子节点开始迭接地发挥作用。 对于类别间任意相容的图表,这些方法关键取决于是否了解域专家或累进专家必须提供的兼容性矩阵。 我们能否直接用原则和可缩放的方式从一个少贴标签的图表直接估计正确的兼容性? 我们肯定地回答这个问题,并建议一种叫做遥远兼容性估计的方法,甚至从极稀疏的标记图表(例如,10,000个节点中的1分被贴上标签)开始,在以后给剩余节点贴标签所花的时间的一小部分时间里起作用。 对于这些图表,我们的方法首先创造多因数化的图形表达方式(与图形的大小无关),然后对这些小的图形草图进行估计。 我们定义代数放大是更一般的关于利用算法的增缩缩略式方程式的等式来放大稀少信号的想法。 我们显示,我们的估测算器是按数量级顺序排列得更快,而不是一种替代方法, 并且这种从最后至端点的标签的精确度分类方法可以比现在的升级前的基化。