带有保障的噪音数据综合方案 (Program Synthesis Over Noisy Data with Guarantees)

We explore and formalize the task of synthesizing programs over noisy data, i.e., data that may contain corrupted input-output examples. By formalizing the concept of a Noise Source, an Input Source, and a prior distribution over programs, we formalize the probabilistic process which constructs a noisy dataset. This formalism allows us to define the correctness of a synthesis algorithm, in terms of its ability to synthesize the hidden underlying program. The probability of a synthesis algorithm being correct depends upon the match between the Noise Source and the Loss Function used in the synthesis algorithm's optimization process. We formalize the concept of an optimal Loss Function given prior information about the Noise Source. We provide a technique to design optimal Loss Functions given perfect and imperfect information about the Noise Sources. We also formalize the concept and conditions required for convergence, i.e., conditions under which the probability that the synthesis algorithm produces a correct program increases as the size of the noisy data set increases. This paper presents the first formalization of the concept of optimal Loss Functions, the first closed form definition of optimal Loss Functions, and the first conditions that ensure that a noisy synthesis algorithm will have convergence guarantees.

翻译：我们探索并正式确定对噪音数据(即可能含有腐败输入输出实例的数据)进行程序合成的任务。我们通过正式确定噪音源的概念、输入源的概念和对程序先前的分布,正式确定构建噪音数据集的概率过程。这种形式主义使我们能够从合成隐藏的原始程序的能力的角度来界定合成算法的正确性。合成算法的正确性取决于合成算法在合成算法优化过程中所使用的噪音源和损失函数的匹配性。我们正式确定预先提供关于噪音源的信息的最佳损失函数概念。我们提供了一种设计最佳损失函数的技术,以提供关于噪音源的完整和不完善的信息。我们还正式确定了趋同所需的概念和条件,即综合算法产生正确程序的可能性随着噪音数据集的大小的增加而增加。本文介绍了最佳损失函数概念的首次正式化、最佳损失函数的第一个封闭形式定义,以及确保压缩综合算法具有趋同保证的第一个条件。

相关内容

损失函数（机器学习）

关注 10

损失函数，在AI中亦称呼距离函数，度量函数。此处的距离代表的是抽象性的，代表真实数据与预测数据之间的误差。损失函数（loss function）是用来估量你模型的预测值f(x)与真实值Y的不一致程度，它是一个非负实值函数,通常使用L(Y, f(x))来表示，损失函数越小，模型的鲁棒性就越好。损失函数是经验风险函数的核心部分，也是结构风险函数重要组成部分。

【SIGIR2020】高效查询自动补全，Efficient and Effective Query Auto-Completion

专知会员服务

10+阅读 · 2020年5月14日