In this paper, we release data about demographic information and outliers of communities of interest. Identified from Wiki-based sources, mainly Wikidata, the data covers 7.5k communities, such as members of the White House Coronavirus Task Force, and 345k subjects, e.g., Deborah Birx. We describe the statistical inference methodology adopted to mine such data. We release subject-centric and group-centric datasets in JSON format, as well as a browsing interface. Finally, we forsee three areas this research can have an impact on: in social sciences research, it provides a resource for demographic analyses; in web-scale collaborative encyclopedias, it serves as an edit recommender to fill knowledge gaps; and in web search, it offers lists of salient statements about queried subjects for higher user engagement.
翻译:本文发布了有关兴趣社区的人口统计和异常值数据,数据主要来源于Wiki,主要是Wikidata,数据涵盖了7.5k个社区,例如白宫冠状病毒特别工作组成员,以及345k个主题,例如Deborah Birx。我们描述了用于挖掘这些数据的统计推断方法。我们以JSON格式发布以主题为中心和以组为中心的数据集,以及一个浏览界面。最后,我们预见到这项研究可能会对以下三个领域产生影响:在社会科学研究中,它提供了一种人口统计分析的资源;在Web规模的协作百科全书中,它作为一个编辑推荐器,填补知识空白;在Web搜索中,它为查询主题提供了显著陈述的列表,以提高用户参与度。