用户评论中未受监督的话题发现 (Unsupervised Topic Discovery in User Comments)

On social media platforms like Twitter, users regularly share their opinions and comments with software vendors and service providers. Popular software products might get thousands of user comments per day. Research has shown that such comments contain valuable information for stakeholders, such as feature ideas, problem reports, or support inquiries. However, it is hard to manually manage and grasp a large amount of user comments, which can be redundant and of a different quality. Consequently, researchers suggested automated approaches to extract valuable comments, e.g., through problem report classifiers. However, these approaches do not aggregate semantically similar comments into specific aspects to provide insights like how often users reported a certain problem. We introduce an approach for automatically discovering topics composed of semantically similar user comments based on deep bidirectional natural language processing algorithms. Stakeholders can use our approach without the need to configure critical parameters like the number of clusters. We present our approach and report on a rigorous multiple-step empirical evaluation to assess how cohesive and meaningful the resulting clusters are. Each evaluation step was peer-coded and resulted in inter-coder agreements of up to 98%, giving us high confidence in the approach. We also report a thematic analysis on the topics discovered from tweets in the telecommunication domain.

翻译：在Twitter等社交媒体平台上,用户定期与软件供应商和服务提供商分享他们的意见和评论。大众软件产品每天可能会收到数千个用户的评论。研究显示,这些评论包含对利益攸关方的宝贵信息,如特色想法、问题报告或支持查询等。然而,很难手工管理和掌握大量用户评论,这些评论可能是多余的,质量不同。因此,研究人员建议采用自动化方法,例如通过问题报告分类器,来提取有价值的评论。然而,这些方法并不将语义相似的评论汇总到具体方面,以提供洞察力,例如用户经常报告的问题。我们采用了一种办法,自动发现由精深双向自然语言处理算法的类似用户评论组成的专题。利益攸关方可以使用我们的方法,无需配置关键参数,如集群的数量。我们介绍了我们的方法,并报告了严格的多步实验评估方法,以评估由此产生的集群的凝聚力和意义。每个评价步骤都是同侪编码的,并产生了高达98 %的相互连接协议。我们非常信任的方法。我们还报告了电信域域域对所发现的专题进行专题分析。