Real-world data typically contain a large number of features that are often heterogeneous in nature, relevance, and also units of measure. When assessing the similarity between data points, one can build various distance measures using subsets of these features. Using the fewest features but still retaining sufficient information about the system is crucial in many statistical learning approaches, particularly when data are sparse. We introduce a statistical test that can assess the relative information retained when using two different distance measures, and determine if they are equivalent, independent, or if one is more informative than the other. This in turn allows finding the most informative distance measure out of a pool of candidates. The approach is applied to find the most relevant policy variables for controlling the Covid-19 epidemic and to find compact yet informative representations of atomic structures, but its potential applications are wide ranging in many branches of science.
翻译:现实世界数据通常包含大量特征,这些特征在性质、相关性和计量单位方面往往各不相同。在评估数据点之间的相似性时,可以使用这些特征的子集建立各种距离测量。使用最少数的特征,但仍保留足够的系统信息,在许多统计学习方法中至关重要,特别是在数据稀少的情况下。我们采用统计测试,评估在使用两种不同距离测量方法时所保留的相对信息,确定它们是否等同、独立或是否比其他方法更丰富。这反过来又可以从一个候选人库中找到信息最丰富的距离测量方法。采用这种方法是为了找到控制Covid-19流行病的最相关的政策变量,并找到原子结构的紧凑但信息丰富的表述,但其潜在应用在许多科学领域十分广泛。