In large datasets, it is hard to discover and analyze structure. It is thus common to introduce tags or keywords for the items. In applications, such datasets are then filtered based on these tags. Still, even medium-sized datasets with a few tags result in complex and for humans hard-to-navigate systems. In this work, we adopt the method of ordinal factor analysis to address this problem. An ordinal factor arranges a subset of the tags in a linear order based on their underlying structure. A complete ordinal factorization, which consists of such ordinal factors, precisely represents the original dataset. Based on such an ordinal factorization, we provide a way to discover and explain relationships between different items and attributes in the dataset. However, computing even just one ordinal factor of high cardinality is computationally complex. We thus propose the greedy algorithm in this work. This algorithm extracts ordinal factors using already existing fast algorithms developed in formal concept analysis. Then, we leverage to propose a comprehensive way to discover relationships in the dataset. We furthermore introduce a distance measure based on the representation emerging from the ordinal factorization to discover similar items. To evaluate the method, we conduct a case study on different datasets.
翻译:在大型数据集中,很难发现和分析结构。 因此, 引入项目标签或关键字是常见的。 在应用程序中, 这样的数据集随后根据这些标记进行过滤。 但是, 即使是带有几个标记的中等数据集, 也会导致复杂, 对人类难以导航的系统造成复杂。 在这项工作中, 我们采用标准系数分析方法来解决这个问题。 一个标准系数根据项目的基本结构在线性顺序中排列一个标签子集。 一个完整的标准系数, 由这类或非标准因素组成, 确切地代表原始数据集。 在这种标准系数化的基础上, 我们提供一种方法来发现和解释数据集中不同项目和属性之间的关系。 但是, 计算高致密度的一个或非标准系数因素是计算复杂的。 我们因此提出这项工作中的贪婪算法。 这个算法利用在正式概念分析中已经开发的快速算法来提取或标准因素。 然后, 我们利用一个完整的函数来提出一种在数据集中发现关系的全面方法。 基于这种正态系数, 我们进一步引入一个基于不同分析的远程分析方法。 我们根据一个不同的分析, 来进行一个不同的分析。