Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale, and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise, and aids in their interpretation. We illustrate this using three diverse comparisons: gene methylation vs expression, evolution of language sounds vs word use, and country-level economic metrics vs cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: a `structural' component analogous to a clustering, and an underlying `relationship' between those structures. This allows a `structural comparison' between two similarity matrices using their predictability from `structure'. Significance is assessed with the help of re-sampling appropriate for each dataset. The software, CLARITY, is available as an R package from https://github.com/danjlawson/CLARITY.
翻译:整合不同学科的数据集十分困难,因为数据在含义、规模和可靠性方面往往质量不同。当两个数据集描述相同实体时,许多科学问题可以围绕以下两个方面来表述:实体之间的(不同)差异是否保存在这种不同数据之间。我们的方法是CLARITY,对各数据集之间的一致性进行量化,找出出现不一致之处,并帮助解释这些数据集。我们用三种不同的比较方法来说明这一点:基因甲基化与表达法,语言声音与字词使用法的演变,以及国家一级的经济指标与文化信仰。非参数方法对噪音和尺度的差别具有很强性,对数据生成方式的假设很弱。它通过将相似性分解成两个组成部分来运作:一个“结构”部分,类似于组合,以及这些结构之间的根本“关系”。这允许利用“结构”的可预测性,在两个相似的矩阵之间进行“结构比较”。通过重新标注适合每个数据集的软件、CLARARITY/LABRIY作为RA的包。