This paper addresses the problem of set-to-set matching, which involves matching two different sets of items based on some criteria, especially in the case of high-dimensional items like images. Although neural networks have been applied to solve this problem, most machine learning-based approaches assume that the training and test data follow the same distribution, which is not always true in real-world scenarios. To address this limitation, we introduce SHIFT15M, a dataset that can be used to evaluate set-to-set matching models when the distribution of data changes between training and testing. We conduct benchmark experiments that demonstrate the performance drop of naive methods due to distribution shift. Additionally, we provide software to handle the SHIFT15M dataset in a simple manner, with the URL for the software to be made available after publication of this manuscript. We believe proposed SHIFT15M dataset provide a valuable resource for evaluating set-to-set matching models under the distribution shift.
翻译:本文讨论了设置到设置的匹配问题,这涉及根据某些标准对两组不同的项目进行匹配,特别是在图像等高维项目的情况下。虽然已经应用神经网络解决这一问题,但大多数基于机器学习的方法假定培训和测试数据采用同样的分布,在现实世界的情景中并不总是如此。为了应对这一限制,我们引入了SHIFT15M数据集,该数据集可用于在培训和测试之间分配数据变化时对设置到设置的匹配模型进行评估。我们进行了基准实验,以显示由于分布转移而导致天真方法的性能下降。此外,我们提供了软件,以简单的方式处理SHIFT15M数据集,并在出版这一手稿后提供软件的URL。我们认为,拟议的SHIFT15M数据集为在分布转移下评价设置到设置的匹配模型提供了宝贵的资源。</s>