Modeling the time evolution of discrete sets of items (e.g., genetic mutations) is a fundamental problem in many biomedical applications. We approach this problem through the lens of continuous-time Markov chains, and show that the resulting learning task is generally underspecified in the usual setting of cross-sectional data. We explore a perhaps surprising remedy: including a number of additional independent items can help determine time order, and hence resolve underspecification. This is in sharp contrast to the common practice of limiting the analysis to a small subset of relevant items, which is followed largely due to poor scaling of existing methods. To put our theoretical insight into practice, we develop an approximate likelihood maximization method for learning continuous-time Markov chains, which can scale to hundreds of items and is orders of magnitude faster than previous methods. We demonstrate the effectiveness of our approach on synthetic and real cancer data.
翻译:模拟离散物品(例如基因突变)的时间演变是许多生物医学应用中的一个根本问题。我们通过连续时间马可夫链的透镜来处理这一问题,并表明由此产生的学习任务通常在通常的跨部门数据设置中未得到充分说明。我们探索了一种也许令人惊讶的补救办法:包括一些其他独立物品可以帮助确定时间顺序,从而解决具体化问题。这与将分析限于少数相关物品(例如遗传突变)的常见做法形成鲜明对照,后者主要由于现有方法的扩展程度不高。我们从理论角度深入了解实践,我们开发了一种学习连续时间马尔可夫链的大致可能性最大化方法,该方法可以覆盖数百个物品,其数量级比以往方法要快。我们展示了我们在合成和真实癌症数据方面的做法的有效性。