The Hidden Markov Model (HMM) is one of the most widely used statistical models for sequential data analysis, and it has been successfully applied in a large variety of domains. One of the key reasons for this versatility is the ability of HMMs to deal with missing data. However, standard HMM learning algorithms rely crucially on the assumption that the positions of the missing observations within the observation sequence are known. In some situations where such assumptions are not feasible, a number of special algorithms have been developed. Currently, these algorithms rely strongly on specific structural assumptions of the underlying chain, such as acyclicity, and are not applicable in the general case. In particular, there are numerous domains within medicine and computational biology, where the missing observation locations are unknown and acyclicity assumptions do not hold, thus presenting a barrier for the application of HMMs in those fields. In this paper we consider a general problem of learning HMMs from data with unknown missing observation locations (i.e., only the order of the non-missing observations are known). We introduce a generative model of the location omissions and propose two learning methods for this model, a (semi) analytic approach, and a Gibbs sampler. We evaluate and compare the algorithms in a variety of scenarios, measuring their reconstruction precision and robustness under model misspecification.
翻译:隐藏的Markov 模型(HMM)是用于连续数据分析的最广泛使用的统计模型之一,已经成功地应用于许多领域。这种多功能性的主要原因之一是HMM处理缺失数据的能力。然而,标准的HMM学习算法关键地依赖于这样一种假设,即观测序列中缺失的观测位置已经为人所知。在一些假设不可行的情况下,已经开发了一些特殊算法。目前,这些算法非常依赖基础链的具体结构假设,例如周期性,并且不适用于一般情况。特别是,在医学和计算生物学中有许多领域,缺少的观察地点未知,周期性假设不成立,从而给在观察序列中应用HMMMMs的位置造成障碍。在本文中,我们认为从未知的缺失观测地点的数据中学习HMMM(即,仅了解不泄露观测的顺序)是一个普遍问题。我们引入了定位模式的基因化模型模型模型模型,并提出了两种学习方法,在这种模型、精确度模型和精确度的模型下,在我们进行精确度的模型和精确度的模型下,对模型和精确度进行了比较。