Many datasets take the form of a bipartite graph where two types of nodes are connected by relationships, like the movies watched by a user or the tags associated with a file. The partitioning of the bipartite graph could be used to fasten recommender systems, or reduce the information retrieval system's index size, by identifying groups of items with similar properties. This type of graph is often processed by algorithms using the Vector Space Model representation, where a binary vector represents an item with 0 and 1. The main problem with this representation is the dimension relatedness, like words' synonymity, which is not considered. This article proposes a co-clustering algorithm using items projection, allowing the measurement of features similarity. We evaluated our algorithm on a cluster retrieval task. Over various datasets, our algorithm produced well balanced clusters with coherent items in, leading to high retrieval scores on this task..
翻译:许多数据集的形式是双部分图,其中两种类型的节点通过关系连接,如用户观看的电影或与文件相关的标签。两部分图的分割可以用来加强推荐系统,或者通过识别具有类似属性的项目组来缩小信息检索系统的索引大小。这种图表通常通过使用矢量空间模型表示法的算法处理,其中二进量矢量代表了 0 和 1 之间的项目。这种表达方式的主要问题是维度关联性,如字词的共名性,这一点没有得到考虑。本文章建议使用项目预测来共同组合算法,允许测量相似性。我们评估了组群检索任务中的算法。在各种数据集中,我们的算法生成了平衡的组群,其项是连贯的,从而导致这项工作的检索分数高。