Data scientists across disciplines are increasingly in need of exploratory analysis tools for data sets with a high volume of features. We expand upon graph mining approaches for exploratory analysis of high-dimensional data to introduce Sirius, a visualization package for researchers to explore feature relationships among mixed data types using mutual information and network backbone sparsification. Visualizations of feature relationships aid data scientists in finding meaningful dependence among features, which can engender further analysis for feature selection, feature extraction, projection, identification of proxy variables, or insight into temporal variation at the macro scale. Graph mining approaches for feature analysis exist, such as association networks of binary features, or correlation networks of quantitative features, but mixed data types present a unique challenge for developing comprehensive feature networks for exploratory analysis. Using an information theoretic approach, Sirius supports heterogeneous data sets consisting of binary, continuous quantitative, and discrete categorical data types, and provides a user interface exploring feature pairs with high mutual information scores. We leverage a backbone sparsification approach from network theory as a dimensionality reduction technique, which probabilistically trims edges according to the local network context. Sirius is an open source Python package and Django web application for exploratory visualization, which can be deployed in data analysis pipelines. The Sirius codebase and exemplary data sets can be found at: https://github.com/compstorylab/sirius
翻译:不同学科的科学家越来越需要为具有大量特征的数据集提供探索性分析工具。我们扩展了用于对高维数据的探索性分析的图形采矿方法,以引入Sirius,这是研究人员利用相互的信息和网络主干网的视觉化组合,探索混合数据类型之间的特征关系。特征关系的可视化有助于数据科学家寻找不同特征之间的有意义的依赖性,从而可以对特征选择、特征提取、投影、代理变量识别或宏观规模时间变异进行进一步分析。存在特征分析的图形采矿方法,例如二元特征的连结网络,或定量特征的相关网络,但混合数据类型对开发探索性分析的全面特征网络提出了独特的挑战。使用信息理论方法,Sirius支持由二元、连续定量和离散的绝对数据类型组成的多种数据集,并提供用户界面,探索具有高共通信息分数的特征配对。我们利用网络理论的骨质垃圾采集法作为降低维度的技术,根据本地网络背景,可以进行概率三角的边缘。Siriusius是开放源Pythón/Django数据库,在Sirniversbasmal brodistryal数据库中可以找到数据分析。