Unlike tabular data, features in network data are interconnected within a domain-specific graph. Examples of this setting include gene expression overlaid on a protein interaction network (PPI) and user opinions in a social network. Network data is typically high-dimensional (large number of nodes) and often contains outlier snapshot instances and noise. In addition, it is often non-trivial and time-consuming to annotate instances with global labels (e.g., disease or normal). How can we jointly select discriminative subnetworks and representative instances for network data without supervision? We address these challenges within an unsupervised framework for joint subnetwork and instance selection in network data, called UISS, via a convex self-representation objective. Given an unlabeled network dataset, UISS identifies representative instances while ignoring outliers. It outperforms state-of-the-art baselines on both discriminative subnetwork selection and representative instance selection, achieving up to 10% accuracy improvement on all real-world data sets we use for evaluation. When employed for exploratory analysis in RNA-seq network samples from multiple studies it produces interpretable and informative summaries.
翻译:与表格数据不同,网络数据的特征在特定域图中是相互关联的。这种设置的例子包括蛋白质互动网络上的基因表达方式以及社交网络中的用户意见。网络数据通常是高维的(大量节点),常常包含外部快照实例和噪音。此外,用全球标签(如疾病或正常)进行批注往往不具有三重和时间性。我们如何在没有监督的情况下为网络数据联合选择歧视性子网络和代表性实例?我们在网络数据联合子网络和实例选择的不受监督的框架内应对这些挑战,我们通过连接自我代表目标,称为 UISS。鉴于未加标签的网络数据集, UIS在忽略外部数据的同时确定了代表性实例。它比在歧视性子网络选择和代表性实例选择方面最先进的基线要强得多,在用于评价的所有真实世界数据集上实现高达10%的准确性改进。当通过多项研究的 RNA-seq网络样本用于进行可解释和资料性摘要的探索性分析时。