通过子模块最佳化交叉搜索功能 (Feature Cross Search via Submodular Optimization)

In this paper, we study feature cross search as a fundamental primitive in feature engineering. The importance of feature cross search especially for the linear model has been known for a while, with well-known textbook examples. In this problem, the goal is to select a small subset of features, combine them to form a new feature (called the crossed feature) by considering their Cartesian product, and find feature crosses to learn an \emph{accurate} model. In particular, we study the problem of maximizing a normalized Area Under the Curve (AUC) of the linear model trained on the crossed feature column. First, we show that it is not possible to provide an $n^{1/\log\log n}$-approximation algorithm for this problem unless the exponential time hypothesis fails. This result also rules out the possibility of solving this problem in polynomial time unless $\mathsf{P}=\mathsf{NP}$. On the positive side, by assuming the \naive\ assumption, we show that there exists a simple greedy $(1-1/e)$-approximation algorithm for this problem. This result is established by relating the AUC to the total variation of the commutator of two probability measures and showing that the total variation of the commutator is monotone and submodular. To show this, we relate the submodularity of this function to the positive semi-definiteness of a corresponding kernel matrix. Then, we use Bochner's theorem to prove the positive semi-definiteness by showing that its inverse Fourier transform is non-negative everywhere. Our techniques and structural results might be of independent interest.

翻译：在本文中, 我们研究交叉搜索是特性工程的基本原始。特征交叉搜索的重要性, 特别是线性模型的重要性已经为人所知, 有著名的教科书实例。在此问题上, 目标是选择一小部分特性, 结合它们形成一个新的特性( 所谓的跨特性 ), 并找到特征交叉点来学习 emph{ cacurate} 模型。特别是, 我们研究在跨线性功能列上训练的线性模型( AUC) 下最大限度地实现一个正常化区域的问题。首先, 我们显示, 除非指数性假设失败, 否则无法为这一问题提供一个小子集, 将它们合并成一个新的特性( 称为跨线性特性特征 ), 并找到特征交叉点来学习。在正面方面, 我们假设了一个简单的 $/ $/ log\ log\ n} n} 。在跨线性模型下, 我们显示一个简单的非正值的非正值, 匹配性值的算法性将显示一个正值的的。和直径向性递性的递性递变的的度的度的值的向性向性的的的的的的向的向性显示着值的的的的的向值值显示的的的的的的的的的的的向值向值向值向值向向值的的的向值。向的的向。向向向向向向向显示的的的的的的向的的的的的的的的向向的的的的的的向向的的的向向向向显示向的的的的的向向向向向向向向向向向的的的的的向下向下的的的的向