An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, where unit changes and variable shift make simple matching of univariate summaries unsuccessful. We develop two novel procedures to address this problem. First, we demonstrate multiple methods of "fingerprinting" a feature based on its associations to other features. In the setting of even modest prior information, this allows most shared features to be accurately identified. Second, we demonstrate a deep learning algorithm for translation between databases. Unlike prior approaches, our algorithm takes advantage of discovered mappings while identifying surrogates for unshared features and learning transformations. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets, while jointly learning to impute unshared or transformed variables.
翻译:数据科学所有领域面临的一个应用问题是协调数据源。 将多种来源的数据与未标定的、仅部分重叠的特征合并在一起,是开发和测试稳健、可通用的算法的先决条件,特别是在医疗保健方面。 我们在处理这一问题时,遇到一个共同但困难的数字特征,如近高斯和二进制特征,单位变化和可变转换使单体摘要的简单匹配失败。 我们开发了两个新程序来解决这个问题。 首先, 我们展示了基于其与其他特征关联的“ 指纹” 特征的多种方法。 在设置更微小的先前信息时, 允许准确识别大多数共享的特征。 其次, 我们展示了数据库之间翻译的深层次学习算法。 与以往的做法不同, 我们的算法在利用两个电子健康记录数据库来识别未共享的特征和学习变异体时, 在合成和现实世界实验中, 我们的算法超越了匹配变量数据集的现有基线, 同时共同学习如何精化非共享或变体变体变体变量。