将概念识别理解为跨越多个地貌空间的一致的数据群集 (Understanding Concept Identification as Consistent Data Clustering Across Multiple Feature Spaces)

Identifying meaningful concepts in large data sets can provide valuable insights into engineering design problems. Concept identification aims at identifying non-overlapping groups of design instances that are similar in a joint space of all features, but which are also similar when considering only subsets of features. These subsets usually comprise features that characterize a design with respect to one specific context, for example, constructive design parameters, performance values, or operation modes. It is desirable to evaluate the quality of design concepts by considering several of these feature subsets in isolation. In particular, meaningful concepts should not only identify dense, well separated groups of data instances, but also provide non-overlapping groups of data that persist when considering pre-defined feature subsets separately. In this work, we propose to view concept identification as a special form of clustering algorithm with a broad range of potential applications beyond engineering design. To illustrate the differences between concept identification and classical clustering algorithms, we apply a recently proposed concept identification algorithm to two synthetic data sets and show the differences in identified solutions. In addition, we introduce the mutual information measure as a metric to evaluate whether solutions return consistent clusters across relevant subsets. To support the novel understanding of concept identification, we consider a simulated data set from a decision-making problem in the energy management domain and show that the identified clusters are more interpretable with respect to relevant feature subsets than clusters found by common clustering algorithms and are thus more suitable to support a decision maker.

翻译：在大型数据集中确定有意义的概念可以对工程设计问题提供宝贵的洞察力。概念的确定旨在确定在所有特点的共同空间中相似、但在只考虑特征子集时也是相似的不重叠的设计实例组。这些子组通常包含在特定背景下设计特点的特点,例如建设性设计参数、性能价值或操作模式。我们最好通过孤立地考虑这些特征中的若干子集来评估设计概念的质量。特别是,有意义的概念不仅应当确定密集、分离的数据实例组,而且还应当提供在分别考虑预先界定的特性组别时持续存在的非重叠数据组。在这项工作中,我们提议将概念的确定视为一种特别的群集算法,具有工程设计以外的广泛潜在应用。为了说明概念的识别和典型群集算法之间的差异,我们最近提出的概念识别算法适用于两个合成数据集,并显示所确定的解决办法的差异。此外,我们采用相互信息计量作为衡量方法,用以评价解决办法是否返回相关子集的一致的群集。为了支持对概念确定的新的理解,我们建议将概念的分类法视为一种特殊的群集算法,因此,通过模拟的分类法来解释一个比模拟的分类式的分类法更适合的数据集。