Hit song prediction, one of the emerging fields in music information retrieval (MIR), remains a considerable challenge. Being able to understand what makes a given song a hit is clearly beneficial to the whole music industry. Previous approaches to hit song prediction have focused on using audio features of a record. This study aims to improve the prediction result of the top 10 hits among Billboard Hot 100 songs using more alternative metadata, including song audio features provided by Spotify, song lyrics, and novel metadata-based features (title topic, popularity continuity and genre class). Five machine learning approaches are applied, including: k-nearest neighbours, Naive Bayes, Random Forest, Logistic Regression and Multilayer Perceptron. Our results show that Random Forest (RF) and Logistic Regression (LR) with all features (including novel features, song audio features and lyrics features) outperforms other models, achieving 89.1% and 87.2% accuracy, and 0.91 and 0.93 AUC, respectively. Our findings also demonstrate the utility of our novel music metadata features, which contributed most to the models' discriminative performance.
翻译:点击歌曲预测是音乐信息检索(MIR)的新兴领域之一,它仍然是一个相当大的挑战。能够理解某首歌曲的出击原因显然有益于整个音乐产业。以前对歌曲预测的方法侧重于使用记录中的音频特征。本研究的目的是利用更多替代元数据,包括Spotify提供的歌曲音频特征、歌曲歌词和基于元数据的新特点(题目为主题、流行连续性和流体类),改进《广告牌热100歌曲》10次点击中的预测结果。采用了五种机器学习方法,包括:K-最近邻、Naive Bayes、随机森林、物流回归和多层受体。我们的结果显示,随机森林(RF)和物流回归(LR)的所有特征(包括新特征、歌曲音频和歌词特征)都优于其他模型,分别达到89.1%和87.2%的准确率,以及0.91和0.93 ACUC。我们的调查结果还表明,我们的新音乐元特征很有用处,对模型的区别性表现起了最大作用。