With the growing use of popular social media services like Facebook and Twitter it is challenging to collect all content from the networks without access to the core infrastructure or paying for it. Thus, if all content cannot be collected one must consider which data are of most importance. In this work we present a novel User-guided Social Media Crawling method (USMC) that is able to collect data from social media, utilizing the wisdom of the crowd to decide the order in which user generated content should be collected to cover as many user interactions as possible. USMC is validated by crawling 160 public Facebook pages, containing content from 368 million users including 1.3 billion interactions, and it is compared with two other crawling methods. The results show that it is possible to cover approximately 75% of the interactions on a Facebook page by sampling just 20% of its posts, and at the same time reduce the crawling time by 53%. In addition, the social network constructed from the 20% sample contains more than 75% of the users and edges compared to the social network created from all posts, and it has similar degree distribution.
翻译:使用Facebook和Twitter等广受欢迎的社交媒体服务越来越多,因此,从网络中收集所有内容而不进入核心基础设施或支付费用,是一项艰巨的任务。 因此,如果无法收集所有内容,那么,如果无法收集所有内容,就必须考虑哪些数据最为重要。 在这项工作中,我们提出了一个能够从社交媒体收集数据的新颖的用户引导社会媒体拼图方法(USMC ), 利用人群的智慧来决定用户生成内容的收集顺序, 以覆盖尽可能多的用户互动。 USMC通过爬行160个公共脸书页面来验证, 包含3.68亿用户的内容, 包括13亿个互动, 并与另外两种爬行方法相比较。 结果显示,通过取样大约20%的页面, 能够覆盖大约75%的Facebook页面互动,同时将快速时间减少53%。 此外,从20%抽样中构建的社会网络包含超过75%的用户和边际, 与从所有邮件创建的社会网络相比, 其分布程度相似。