Digital contact tracing apps for COVID-19, such as the one developed by Google and Apple, need to estimate the risk that a user was infected during a particular exposure, in order to decide whether to notify the user to take precautions, such as entering into quarantine, or requesting a test. Such risk score models contain numerous parameters that must be set by the public health authority. Although expert guidance for how to set these parameters has been provided (e.g. https://github.com/lfph/gaen-risk-scoring/blob/main/risk-scoring.md), it is natural to ask if we could do better using a data-driven approach. This can be particularly useful when the risk factors of the disease change, e.g., due to the evolution of new variants, or the adoption of vaccines. In this paper, we show that machine learning methods can be used to automatically optimize the parameters of the risk score model, provided we have access to exposure and outcome data. Although this data is already being collected in an aggregated, privacy-preserving way by several health authorities, in this paper we limit ourselves to simulated data, so that we can systematically study the different factors that affect the feasibility of the approach. In particular, we show that the parameters become harder to estimate when there is more missing data (e.g., due to infections which were not recorded by the app). Nevertheless, the learning approach outperforms a strong manually designed baseline.
翻译:COVID-19的数字联系人追踪应用软件,例如谷歌和苹果开发的软件,需要估计用户在特定接触期间感染的风险,以便决定是否通知用户采取预防措施,例如检疫,或要求测试。这种风险评分模型包含许多必须由公共卫生当局确定的参数。虽然为如何确定这些参数提供了专家指导(例如,https://github.com/lfph/gaen-risk-scorring/blob/main/main/risk-scolring.md),但自然地会问,我们是否可以使用数据驱动的方法更好地进行感染。当疾病风险因素发生变化时,例如由于新的变异体的演变,或疫苗的采用,这种风险评分模型可能特别有用。在本文中,虽然可以使用机器学习方法自动优化风险评分模型的参数,只要我们有机会接触接触风险和结果数据。虽然这些数据已经以汇总方式收集,但由若干卫生当局以保密方式保存,在本文中,如果使用数据驱动力的方法,那么这样做可能特别有用。当疾病变化的风险因素发生变化时,我们只能以更精确的方式模拟数据。