The paper presents a cross-domain review analysis on four popular review datasets: Amazon, Yelp, Steam, IMDb. The analysis is performed using Hadoop and Spark, which allows for efficient and scalable processing of large datasets. By examining close to 12 million reviews from these four online forums, we hope to uncover interesting trends in sales and customer sentiment over the years. Our analysis will include a study of the number of reviews and their distribution over time, as well as an examination of the relationship between various review attributes such as upvotes, creation time, rating, and sentiment. By comparing the reviews across different domains, we hope to gain insight into the factors that drive customer satisfaction and engagement in different product categories.
翻译:本文介绍了对亚马逊、叶尔普、斯捷姆、IMDb四套大众审查数据集的跨部审查分析。分析使用Hadoop和Spark进行,从而能够高效和可扩缩地处理大型数据集。通过审查这四个在线论坛近1 200万份审查,我们希望能够发现多年来销售和客户情绪方面令人感兴趣的趋势。我们的分析将包括研究审查的数量及其随时间的分布,以及审查各种审查属性之间的关系,例如高音、创建时间、评级和情绪。通过比较不同领域的审查,我们希望能够深入了解促使客户满意和参与不同产品类别的因素。