For many data mining and machine learning tasks, the quality of a similarity measure is the key for their performance. To automatically find a good similarity measure from datasets, metric learning and similarity learning are proposed and studied extensively. Metric learning will learn a Mahalanobis distance based on positive semi-definite (PSD) matrix, to measure the distances between objectives, while similarity learning aims to directly learn a similarity function without PSD constraint so that it is more attractive. Most of the existing similarity learning algorithms are online similarity learning method, since online learning is more scalable than offline learning. However, most existing online similarity learning algorithms learn a full matrix with d 2 parameters, where d is the dimension of the instances. This is clearly inefficient for high dimensional tasks due to its high memory and computational complexity. To solve this issue, we introduce several Sparse Online Relative Similarity (SORS) learning algorithms, which learn a sparse model during the learning process, so that the memory and computational cost can be significantly reduced. We theoretically analyze the proposed algorithms, and evaluate them on some real-world high dimensional datasets. Encouraging empirical results demonstrate the advantages of our approach in terms of efficiency and efficacy.
翻译:对于许多数据挖掘和机器学习任务来说,相似度测量的质量是其业绩的关键。为了从数据集中自动找到良好的相似度度量,建议并广泛研究衡量标准学习和相似度学习。 计量学习将学习基于正半确定矩阵的马哈拉诺比距离,以测量目标之间的距离,而相似度学习的目的是直接学习相似性功能,而不受到私营部门司的制约,这样就更具吸引力。 现有的相似度学习算法大多是在线相似度学习方法,因为在线学习比离线学习更具有可伸缩性。 然而,大多数现有的在线相似度学习算法都学习了带有D2参数的完整矩阵,而D2参数是实例的维度。由于高记忆和计算复杂性,这对高维度任务显然效率。为了解决这个问题,我们引入了几种粗的在线相对相似性学习算法,在学习过程中学习一种稀疏的模型,以便记忆和计算成本可以大大降低。我们从理论上分析拟议的算法,并评估了某些现实世界高维度数据配置的优势。