Databases covering all individuals of a population are increasingly used for research studies in domains ranging from public health to the social sciences. There is also growing interest by governments and businesses to use population data to support data-driven decision making. The massive size of such databases is often mistaken as a guarantee for valid inferences on the population of interest. However, population data have characteristics that make them challenging to use, including various assumptions being made how such data were collected and what types of processing have been applied to them. Furthermore, the full potential of population data can often only be unlocked when such data are linked to other databases, a process that adds fresh challenges. This article discusses a diverse range of misconceptions about population data that we believe anybody who works with such data needs to be aware of. Many of these misconceptions are not well documented in scientific publications but only discussed anecdotally among researchers and practitioners. We conclude with a set of recommendations for inference when using population data.
翻译:涵盖所有人口的个人的数据库越来越多地用于从公共卫生到社会科学等领域的研究,政府和企业也越来越有兴趣利用人口数据支持数据驱动的决策,这类数据库的庞大规模往往被误认为是对有效推断有关人口的有效保证,然而,人口数据的特点使它们难以使用,包括各种假设如何收集此类数据,以及对它们采用何种处理方法。此外,人口数据的全部潜力往往只有在这类数据与其他数据库联系起来时才能解开,而这一过程又增加了新的挑战。这篇文章讨论了关于人口数据的各种误解,我们认为任何从事此类数据工作的人都必须了解这些误解。许多这些误解没有在科学出版物中很好地记录下来,只是在研究人员和从业者之间巧妙地讨论。我们最后提出了一套在使用人口数据时进行推论的建议。