At the core of many important machine learning problems faced by online streaming services is a need to model how users interact with the content they are served. Unfortunately, there are no public datasets currently available that enable researchers to explore this topic. In order to spur that research, we release the Music Streaming Sessions Dataset (MSSD), which consists of 160 million listening sessions and associated user actions. Furthermore, we provide audio features and metadata for the approximately 3.7 million unique tracks referred to in the logs. This is the largest collection of such track metadata currently available to the public. This dataset enables research on important problems including how to model user listening and interaction behaviour in streaming, as well as Music Information Retrieval (MIR), and session-based sequential recommendations. Additionally, a subset of sessions were collected using a uniformly random recommendation setting, enabling their use for counterfactual evaluation of such sequential recommendations. Finally, we provide an analysis of user behavior and suggest further research problems which can be addressed using the dataset.
翻译:在线流服务所面临的许多重要机器学习问题的核心是需要模拟用户如何与所服务的内容互动。 不幸的是,目前没有可供研究人员探索这个主题的公共数据集。为了刺激这一研究,我们发布了音乐流会数据集(MSSD),其中包括1.6亿次监听会和相关用户行动。此外,我们为日志中提及的大约370万条独有轨道提供了音频特征和元数据。这是目前可供公众使用的最大一批此类跟踪元数据。该数据集使人们能够研究重要问题,包括如何模拟用户在流传中的监听和互动行为,以及音乐信息检索和会议顺序建议。此外,还利用统一的随机建议设置收集了一组会议,以便利用它们来反事实评估这些顺序建议。最后,我们分析了用户的行为,并建议了可以用数据集解决的进一步研究问题。