We explore and formalize the task of synthesizing programs over noisy data, i.e., data that may contain corrupted input-output examples. By formalizing the concept of a Noise Source, an Input Source, and a prior distribution over programs, we formalize the probabilistic process which constructs a noisy dataset. This formalism allows us to define the correctness of a synthesis algorithm, in terms of its ability to synthesize the hidden underlying program. The probability of a synthesis algorithm being correct depends upon the match between the Noise Source and the Loss Function used in the synthesis algorithm's optimization process. We formalize the concept of an optimal Loss Function given prior information about the Noise Source. We provide a technique to design optimal Loss Functions given perfect and imperfect information about the Noise Sources. We also formalize the concept and conditions required for convergence, i.e., conditions under which the probability that the synthesis algorithm produces a correct program increases as the size of the noisy data set increases. This paper presents the first formalization of the concept of optimal Loss Functions, the first closed form definition of optimal Loss Functions, and the first conditions that ensure that a noisy synthesis algorithm will have convergence guarantees.
翻译:我们探索并正式确定对噪音数据(即可能含有腐败输入输出实例的数据)进行程序合成的任务。我们通过正式确定噪音源的概念、输入源的概念和对程序先前的分布,正式确定构建噪音数据集的概率过程。这种形式主义使我们能够从合成隐藏的原始程序的能力的角度来界定合成算法的正确性。合成算法的正确性取决于合成算法在合成算法优化过程中所使用的噪音源和损失函数的匹配性。我们正式确定预先提供关于噪音源的信息的最佳损失函数概念。我们提供了一种设计最佳损失函数的技术,以提供关于噪音源的完整和不完善的信息。我们还正式确定了趋同所需的概念和条件,即综合算法产生正确程序的可能性随着噪音数据集的大小的增加而增加。本文介绍了最佳损失函数概念的首次正式化、最佳损失函数的第一个封闭形式定义,以及确保压缩综合算法具有趋同保证的第一个条件。