The irreplaceable key to the triumph of Question & Answer (Q&A) platforms is their users providing high-quality answers to the challenging questions posted across various topics of interest. From more than a decade, the expert finding problem attracted much attention in information retrieval research. Based on the encountered gaps in the expert identification across several Q&A portals, we inspect the feasibility of identifying data science experts in Reddit. Our method is based on the manual coding results where two data science experts labelled not only expert and non-expert comments, but also out-of-scope comments, which is a novel contribution to the literature, enabling the identification of more groups of comments across web portals. We present a semi-supervised approach which combines 1,113 labelled comments with 100,226 unlabelled comments during training. The proposed model uses the activity behaviour of every user, including Natural Language Processing (NLP), crowdsourced and user feature sets. We conclude that the NLP and user feature sets contribute the most to the better identification of these three classes. It means that this method can generalise well within the domain. Finally, we make a novel contribution by presenting different types of users in Reddit, which opens many future research directions.
翻译:问题和答案(QA)平台取得胜利的不可替代的关键在于其用户对各种感兴趣的议题提出的具有挑战性的问题提供高质量的答案。从十多年以来,专家发现问题在信息检索研究中引起了很大的注意。根据在专家识别方面发现的若干“A”门户网站上遇到的差距,我们检查了在Reddit中识别数据科学专家的可行性。我们的方法是基于手册编码结果,其中两名数据科学专家不仅将专家和非专家的评论标注为专家和非专家的评论,而且还将范围外的评论标为“评论”,这是对文献的新的贡献,使得能够查明跨网络门户的更多评论群。我们提出了一个半监督方法,将1 113条贴标签的评论与100 226条未贴标签的评论结合到信息检索研究中。拟议模型使用每个用户的活动行为,包括自然语言处理、众源和用户特征。我们的结论是,“NLP”和用户特征对更好地识别这三类内容的贡献最大。这意味着,这种方法可以在域内进行广泛归纳。最后,我们提出了一种半监督的方法,将1 113条标有标签的评论与100 226条在培训中未贴标签的评论合并。拟议模型使用每个用户的活动行为。我们通过展示了许多类型的未来研究方向作出新的贡献。