The commonly used latent space embedding techniques, such as Principal Component Analysis, Factor Analysis, and manifold learning techniques, are typically used for learning effective representations of homogeneous data. However, they do not readily extend to heterogeneous data that are a combination of numerical and categorical variables, e.g., arising from linked GPS and text data. In this paper, we are interested in learning probabilistic generative models from high-dimensional heterogeneous data in an unsupervised fashion. The learned generative model provides latent unified representations that capture the factors common to the multiple dimensions of the data, and thus enable fusing multimodal data for various machine learning tasks. Following a Bayesian approach, we propose a general framework that combines disparate data types through the natural parameterization of the exponential family of distributions. To scale the model inference to millions of instances with thousands of features, we use the Laplace-Bernstein approximation for posterior computations involving nonlinear link functions. The proposed algorithm is presented in detail for the commonly encountered heterogeneous datasets with real-valued (Gaussian) and categorical (multinomial) features. Experiments on two high-dimensional and heterogeneous datasets (NYC Taxi and MovieLens-10M) demonstrate the scalability and competitive performance of the proposed algorithm on different machine learning tasks such as anomaly detection, data imputation, and recommender systems.
翻译:通常使用的潜在空间嵌入技术,如主元件分析、系数分析和多重学习技术,通常用于学习对同质数据的有效表示,但是,这些技术并不容易推广到由数字和绝对变量(例如,由链接的全球定位系统和文本数据)产生的混合数据。在本文中,我们有兴趣以不受监督的方式从高维的多元数据中学习概率化模型,以高维的多元数据,用不受监督的方式;学习的基因化模型提供潜在的统一表达,以捕捉数据多个层面的共同因素,从而能够将多式联运数据用于各种机器学习任务。采用贝叶斯办法,我们提出一个总框架,通过分布分布的指数式组合的自然参数化,将不同数据类型结合起来。为了将模型比喻成数以百万计的特征,我们使用Laplace-Bernstein近似值来进行涉及非线链接功能的后台计算。拟议的算法详细介绍了以实际估值(Gausilanian)和直线式(Mulnisial)为推荐性(Multinisial)特征,将不同数据类型数据类型和Slimomal-LSyalSyalSyal 和SalistrevalSalistemass smalistemass smal 演示,作为两个高的实验性数据测试。