We consider independent component analysis of binary data. While fundamental in practice, this case has been much less developed than ICA for continuous data. We start by assuming a linear mixing model in a continuous-valued latent space, followed by a binary observation model. Importantly, we assume that the sources are non-stationary; this is necessary since any non-Gaussianity would essentially be destroyed by the binarization. Interestingly, the model allows for closed-form likelihood by employing the cumulative distribution function of the multivariate Gaussian distribution. In stark contrast to the continuous-valued case, we prove non-identifiability of the model with few observed variables; our empirical results imply identifiability when the number of observed variables is higher. We present a practical method for binary ICA that uses only pairwise marginals, which are faster to compute than the full multivariate likelihood. Experiments give insight into the requirements for the number of observed variables, segments, and latent sources that allow the model to be estimated.
翻译:我们考虑对二进制数据进行独立的组成部分分析。 在实践上,这个案例在持续数据方面远不如ICA那么发达。 我们首先假设一个线性混合模型,在连续价值潜伏空间中,然后是二进制观测模型。 重要的是,我们假设来源是非静止的; 这是必要的, 因为任何非Gaussianity基本上都会被二进制破坏。 有趣的是, 该模型通过使用多变量 Gaussian 分布的累积分布功能, 允许封闭形式的可能性。 与连续价值案例截然不同的是, 我们证明该模型与少数观察到的变量无法识别; 我们的经验结果意味着在所观察到的变量数量较高时可以识别。 我们为二进制ICA提出了一个实用方法, 仅使用对称边, 其计算速度要快于全部多变制可能性。 实验揭示了观察到的变量、 区段和潜在来源的数量要求, 从而可以对模型进行估计。