Mobile Device Location Data (MDLD) has been popularly utilized in various fields. Yet its large-scale applications are limited because of either biased or insufficient spatial coverage of the data from individual data vendors. One approach to improve the data coverage is to leverage the data from multiple data vendors and integrate them to build a more representative dataset. For data integration, further treatments on the multi-sourced dataset are required due to several reasons. First, the possibility of carrying more than one device could result in duplicated observations from the same data subject. Additionally, when utilizing multiple data sources, the same device might be captured by more than one data provider. Our paper proposes a data integration methodology for multi-sourced data to investigate the feasibility of integrating data from several sources without introducing additional biases to the data. By leveraging the uniqueness of travel pattern of each device, duplicate devices are identified. The proposed methodology is shown to be cost-effective while it achieves the desired accuracy level. Our findings suggest that devices sharing the same imputed home location and the top five most-visited locations during a month can represent the same user in the MDLD. It is shown that more than 99.6% of the sample devices having the aforementioned attribute in common are observed at the same location simultaneously. Finally, the proposed algorithm has been successfully applied to the national-level MDLD of 2020 to produce the national passenger origin-destination data for the NextGeneration National Household Travel Survey (NextGen NHTS) program.
翻译:移动设备位置数据(MDLD)被广泛用于不同领域,但其大规模应用却有限,因为来自单个数据供应商的数据有偏差或空间覆盖面不足。改进数据覆盖范围的一个办法是利用多个数据供应商的数据,并整合这些数据,以建立更具代表性的数据集。对于数据整合,由于若干原因,需要对多来源数据集作进一步处理。首先,携带一个以上设备的可能性可能导致同一数据主题的重复观测。此外,在使用多个数据源时,同一设备可能会被不止一个数据提供者捕获。我们的文件建议采用多来源数据集成方法,以调查从多个来源整合数据的可行性,而不对数据引入更多的偏差。通过利用每个设备的独特旅行模式,可以确定重复的装置。在达到预期的准确度的同时,拟议方法也具有成本效益。我们的调查结果表明,在一个月内,使用同一估算的家位和前五个访问地点的装置,可以代表MDLDD的同一用户。我们的文件提议多来源数据集集集数据集集成方法,而无需对数据产生更多偏差。通过利用每个来源的国家数据采集的G(在2020年国家测算器的位置上,最终将超过99.6%的国家测算为2020年国家测算)。</s>