Measuring the distance between data points is fundamental to many statistical techniques, such as dimension reduction or clustering algorithms. However, improvements in data collection technologies has led to a growing versatility of structured data for which standard distance measures are inapplicable. In this paper, we consider the problem of measuring the distance between sequences and multisets of points lying within a metric space, motivated by the analysis of an in-play football data set. Drawing on the wider literature, including that of time series analysis and optimal transport, we discuss various distances which are available in such an instance. For each distance, we state and prove theoretical properties, proposing possible extensions where they fail. Finally, via an example analysis of the in-play football data, we illustrate the usefulness of these distances in practice.
翻译:测量数据点之间的距离是许多统计技术的根本,例如尺寸减少或组合算法等,但是,数据收集技术的改进导致结构化数据日益多样化,而标准距离措施无法适用。在本文件中,我们考虑了测量一个计量空间内的序列和多组点之间的距离的问题,这是以分析一个在轨足球数据集为动力的。我们利用更广泛的文献,包括时间序列分析和最佳运输的文献,讨论了在此类实例中可以使用的各种距离。我们指出并证明每个距离的理论属性,提出在失败时可能的扩展。最后,我们通过对在轨足球数据进行举例分析,说明这些距离在实践中的用处。