深学习建议系统的数据优化 (Data Optimisation for a Deep Learning Recommender System)

This paper advocates privacy preserving requirements on collection of user data for recommender systems. The purpose of our study is twofold. First, we ask if restrictions on data collection will hurt test quality of RNN-based recommendations. We study how validation performance depends on the available amount of training data. We use a combination of top-K accuracy, catalog coverage and novelty for this purpose, since good recommendations for the user is not necessarily captured by a traditional accuracy metric. Second, we ask if we can improve the quality under minimal data by using secondary data sources. We propose knowledge transfer for this purpose and construct a representation to measure similarities between purchase behaviour in data. This to make qualified judgements of which source domain will contribute the most. Our results show that (i) there is a saturation in test performance when training size is increased above a critical point. We also discuss the interplay between different performance metrics, and properties of data. Moreover, we demonstrate that (ii) our representation is meaningful for measuring purchase behaviour. In particular, results show that we can leverage secondary data to improve validation performance if we select a relevant source domain according to our similarly measure.

翻译：本文主张对为推荐者系统收集用户数据进行隐私保护的要求。我们的研究有两个目的。首先, 我们问, 对数据收集的限制是否会损害基于RNN的建议的测试质量。我们研究验证业绩如何取决于现有培训数据的数量。我们为此使用最高-K准确度、目录覆盖度和新颖性的综合方法, 因为对用户的良好建议不一定通过传统的精确度衡量标准得到体现。第二, 我们问, 我们是否可以使用二级数据源来提高最低数据的质量。我们为此建议知识转让, 并建立一个代表单位, 以衡量数据中购买行为之间的相似性。这是为了对哪个来源域作出合格的判断, 以作出最有助于取得最大效果的判断。我们的结果显示 (一) 当培训规模超过临界点时, 测试性能是饱和的。我们还讨论不同性能指标和数据属性之间的相互作用。此外, 我们证明 (二) 我们的表述对于衡量采购行为是有意义的。特别是, 结果表明, 如果我们选择了类似的源域, 我们可以利用二级数据来改进验证业绩。