Social media studies often collect data retrospectively to analyze public opinion. Social media data may decay over time and such decay may prevent the collection of the complete dataset. As a result, the collected dataset may differ from the complete dataset and the study may suffer from data persistence bias. Past research suggests that the datasets collected retrospectively are largely representative of the original dataset in terms of textual content. However, no study analyzed the impact of data persistence bias on social media studies such as those focusing on controversial topics. In this study, we analyze the data persistence and the bias it introduces on the datasets of three types: controversial topics, trending topics, and framing of issues. We report which topics are more likely to suffer from data persistence among these datasets. We quantify the data persistence bias using the change in political orientation, the presence of potentially harmful content and topics as measures. We found that controversial datasets are more likely to suffer from data persistence and they lean towards the political left upon recollection. The turnout of the data that contain potentially harmful content is significantly lower on non-controversial datasets. Overall, we found that the topics promoted by right-aligned users are more likely to suffer from data persistence. Account suspensions are the primary factor contributing to data removals, if not the only one. Our results emphasize the importance of accounting for the data persistence bias by collecting the data in real time when the dataset employed is vulnerable to data persistence bias.
翻译:社会媒体研究经常收集回溯性的数据,以分析公众舆论。社会媒体数据可能随着时间而衰落,这种衰变可能妨碍收集完整的数据集。结果,所收集的数据集可能不同于完整的数据集,研究可能存在数据持久性偏差。过去的研究表明,从文字内容上,追溯性收集的数据集基本上代表原始数据集。然而,没有研究分析数据持久性偏差对社会媒体研究的影响,例如那些侧重于有争议的议题的研究。在本研究中,我们分析数据持久性和它在三类数据集上提出的偏差:有争议的议题、趋势论题和问题设置。我们报告哪些专题更有可能因这些数据集的数据持久性而受到影响。我们用政治取向的变化来量化数据持久性偏差,存在潜在有害的内容和计量主题。我们发现,有争议的数据集更有可能因数据的持久性而受到影响,在重新收集时倾向于政治左倾斜斜。在非重叠数据集上,我们发现,在持续性数据方面,我们所提倡的固定性数据的重要性更大。我们所提倡的固定性数据只是用于持续性的数据,在不固定性数据中,只有一种持续性数据才可能使一个持续性数据成为不固定性数据。我们所提倡的数据。我们所提倡的固定性数据。</s>