利用创性反反向网络为合作过滤建议系统创建合成数据集</s> (Creating Synthetic Datasets for Collaborative Filtering Recommender Systems using Generative Adversarial Networks)

Research and education in machine learning needs diverse, representative, and open datasets that contain sufficient samples to handle the necessary training, validation, and testing tasks. Currently, the Recommender Systems area includes a large number of subfields in which accuracy and beyond accuracy quality measures are continuously improved. To feed this research variety, it is necessary and convenient to reinforce the existing datasets with synthetic ones. This paper proposes a Generative Adversarial Network (GAN)-based method to generate collaborative filtering datasets in a parameterized way, by selecting their preferred number of users, items, samples, and stochastic variability. This parameterization cannot be made using regular GANs. Our GAN model is fed with dense, short, and continuous embedding representations of items and users, instead of sparse, large, and discrete vectors, to make an accurate and quick learning, compared to the traditional approach based on large and sparse input vectors. The proposed architecture includes a DeepMF model to extract the dense user and item embeddings, as well as a clustering process to convert from the dense GAN generated samples to the discrete and sparse ones, necessary to create each required synthetic dataset. The results of three different source datasets show adequate distributions and expected quality values and evolutions on the generated datasets compared to the source ones. Synthetic datasets and source codes are available to researchers.

翻译：机器学习方面的研究和教育需要多样化、代表性和开放的数据集,其中包括足够的样本,足以处理必要的培训、验证和测试任务。目前,建议系统领域包括大量子领域,不断改进准确性和超出准确性的质量措施。为这种研究品种提供食物,有必要而且方便地加强现有数据集的合成数据集。本文件建议采用基于基因反转网络(GAN)的方法,以参数化的方式生成协作过滤数据集,方法是选择其偏好数量的用户、项目、样品和随机变异性。这一参数化无法使用常规的GAN进行。我们的GAN模型以密集、短和连续的嵌入式展示项目和用户,而不是以稀疏、大和离散的矢量为原料。与基于大型和稀疏输入矢量的传统方法相比,准确和快速学习是必要的。提议的架构包括一个深MF模型,以提取密度的用户和物品嵌入为主的嵌入点,以及从密度稠密的GAN生成的样品转换为离散和稀散的样品。我们的GAN模型模型模型含有大量、短小的特性。我们GAN模型的模型以密集、短短短短、短小的内嵌嵌嵌嵌入式的模型,用来以不断嵌入式的缩嵌入式显示项目和嵌入式的物品和嵌入源码,用来以精确和不断嵌入式的模型,以精确和同步学习和同步学习,从而根据不同的数据源码为所需的各种数据源码。在每种数据源码。在不同的数据流数据流的模型和同步的模型,需要的模型是必要的、必要的、必要的、必要的、所需的各种数据源码。在不同的数据源码。在不同的源码和同步的模型,需要的计算出所需的每一个制数据序列式数据源码。</s>

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日