In light of the COVID-19 pandemic, it is an open challenge and critical practical problem to find a optimal way to dynamically prescribe the best policies that balance both the governmental resources and epidemic control in different countries and regions. To solve this multi-dimensional tradeoff of exploitation and exploration, we formulate this technical challenge as a contextual combinatorial bandit problem that jointly optimizes a multi-criteria reward function. Given the historical daily cases in a region and the past intervention plans in place, the agent should generate useful intervention plans that policy makers can implement in real time to minimizing both the number of daily COVID-19 cases and the stringency of the recommended interventions. We prove this concept with simulations of multiple realistic policy making scenarios.
翻译:鉴于COVID-19大流行,找到一种最佳方式来灵活地规定平衡不同国家和区域的政府资源和流行病控制的最佳政策,是一个公开的挑战和关键的实际问题。为了解决这种多方面的开采和勘探权衡,我们将这一技术挑战发展成一个背景的组合式强盗问题,共同优化多标准奖励功能。鉴于一个区域的历史日常案例和过去的干预计划,该代理人应制定有用的干预计划,决策者可以实时实施这些计划,以尽量减少每天COVID-19案件的数量和所建议的干预措施的严格性。我们用模拟多种现实的决策情景来证明这一概念。