We study two-sample variable selection: identifying variables that discriminate between the distributions of two sets of data vectors. Such variables help scientists understand the mechanisms behind dataset discrepancies. Although domain-specific methods exist (e.g., in medical imaging, genetics, and computational social science), a general framework remains underdeveloped. We make two separate contributions. (i) We introduce a mathematical notion of the discriminating set of variables: the largest subset containing no variables whose marginals are identical across the two distributions and independent of the remaining variables. We prove this set is uniquely defined and establish further properties, making it a suitable ground truth for theory and evaluation. (ii) We propose two methods for two-sample variable selection that assign weights to variables and optimise them to maximise the power of a kernel two-sample test while enforcing sparsity to downweight redundant variables. To select the regularisation parameter - unknown in practice, as it controls the number of selected variables - we develop two data-driven procedures to balance recall and precision. Synthetic experiments show improved performance over baselines, and we illustrate the approach on two applications using datasets from water-pipe and traffic networks.
翻译:我们研究两样本变量选择问题:识别能够区分两组数据向量分布的变量。此类变量有助于科学家理解数据集差异背后的机制。尽管存在特定领域的方法(例如在医学影像、遗传学和计算社会科学中),但通用框架仍不完善。我们做出两项独立贡献。(i)我们引入变量判别集的数学概念:即包含无变量的最大子集,这些变量的边缘分布在两个分布中相同且与其余变量独立。我们证明该集合是唯一定义的,并建立了进一步的性质,使其成为理论与评估的合适基准。(ii)我们提出两种两样本变量选择方法,为变量分配权重并通过优化权重来最大化核两样本检验的功效,同时施加稀疏性以降低冗余变量的权重。为选择正则化参数(实践中未知,因其控制所选变量数量),我们开发了两种数据驱动程序以平衡召回率与精确率。合成实验显示其性能优于基线方法,并通过水管网络和交通网络数据集的两个应用案例展示了该方法的实用性。