恒星中的错误: Android App 评估分析 (Fault in your stars: An Analysis of Android App Reviews)

Mobile app distribution platforms such as Google Play Store allow users to share their feedback about downloaded apps in the form of a review comment and a corresponding star rating. Typically, the star rating ranges from one to five stars, with one star denoting a high sense of dissatisfaction with the app and five stars denoting a high sense of satisfaction. Unfortunately, due to a variety of reasons, often the star rating provided by a user is inconsistent with the opinion expressed in the review. For example, consider the following review for the Facebook App on Android; "Awesome App". One would reasonably expect the rating for this review to be five stars, but the actual rating is one star! Such inconsistent ratings can lead to a deflated (or inflated) overall average rating of an app which can affect user downloads, as typically users look at the average star ratings while making a decision on downloading an app. Also, the app developers receive a biased feedback about the application that does not represent ground reality. This is especially significant for small apps with a few thousand downloads as even a small number of mismatched reviews can bring down the average rating drastically. In this paper, we conducted a study on this review-rating mismatch problem. We manually examined 8600 reviews from 10 popular Android apps and found that 20% of the ratings in our dataset were inconsistent with the review. Further, we developed three systems; two of which were based on traditional machine learning and one on deep learning to automatically identify reviews whose rating did not match with the opinion expressed in the review. Our deep learning system performed the best and had an accuracy of 92% in identifying the correct star rating to be associated with a given review.

翻译：Google Play Store 等移动应用程序分发平台让用户能够以评论评论和相应的星级评级的形式分享关于下载应用程序的反馈。通常, 星级评级从一星到五星不等, 其中一颗星级表示对应用程序有高度的不满感, 五颗星级表示高度的满意感。不幸的是, 由于各种原因, 用户提供的星级评级往往与审查中表达的意见不一致。例如, 考虑对Anderoid 的Facebook App 进行下一轮审查; “ 优秀的 App ” 。人们会合理地期望本次审查的评级为五星, 但实际评级为一星级。这种深度的评级可能导致对一个软件的总体平均评级的降幅( 或膨胀), 这可能会影响用户的下载。典型的用户在下载应用程序时会查看平均的星级评级, 而做出一个决定。另外, 应用程序开发者会收到关于应用程序的偏差反馈。这对小型应用程序来说特别重要, 几千次的下载, 哪怕是少数次不匹配的下载会大幅降低平均评级, 但实际评级是一星级的评分。在本文中, 我们进行了一次的排序审查, 进行了20次的评比。。我们做了一次的评比。