We present a case study investigating feature descriptors in the context of the analysis of chemical multivariate ensemble data. The data of each ensemble member consists of three parts: the design parameters for each ensemble member, field data resulting from the numerical simulations, and physical properties of the molecules. Since feature-based methods have the potential to reduce the data complexity and facilitate comparison and clustering, we are focusing on such methods. However, there are many options to design the feature vector representation and there is no obvious preference. To get a better understanding of the different representations, we analyze their similarities and differences. Thereby, we focus on three characteristics derived from the representations: the distribution of pairwise distances, the clustering tendency, and the rank-order of the pairwise distances. The results of our investigations partially confirmed expected behavior, but also provided some surprising observations that can be used for the future development of feature representations in the chemical domain.
翻译:在分析化学多变共变数据时,我们提出一个案例研究,调查特征描述符,每个共同成员的数据由三部分组成:每个共同成员的设计参数、数字模拟产生的实地数据以及分子的物理特性。由于基于特征的方法有可能降低数据复杂性,便于比较和组合,我们正在集中研究这些方法。然而,在设计特性矢量代表方面有许多选择,没有明显的偏好。为了更好地了解不同的表述,我们分析了它们的相似性和差异。因此,我们侧重于从这些表述中得出的三个特征:配对距离的分布、组合趋势以及配对距离的等级顺序。我们的调查结果部分证实了预期的行为,但也提出了一些令人惊讶的意见,可用于今后发展化学领域的特征表述。