Spreadsheet formula prediction has been an important program synthesis problem with many real-world applications. Previous works typically utilize input-output examples as the specification for spreadsheet formula synthesis, where each input-output pair simulates a separate row in the spreadsheet. However, this formulation does not fully capture the rich context in real-world spreadsheets. First, spreadsheet data entries are organized as tables, thus rows and columns are not necessarily independent from each other. In addition, many spreadsheet tables include headers, which provide high-level descriptions of the cell data. However, previous synthesis approaches do not consider headers as part of the specification. In this work, we present the first approach for synthesizing spreadsheet formulas from tabular context, which includes both headers and semi-structured tabular data. In particular, we propose SpreadsheetCoder, a BERT-based model architecture to represent the tabular context in both row-based and column-based formats. We train our model on a large dataset of spreadsheets, and demonstrate that SpreadsheetCoder achieves top-1 prediction accuracy of 42.51%, which is a considerable improvement over baselines that do not employ rich tabular context. Compared to the rule-based system, SpreadsheetCoder assists 82% more users in composing formulas on Google Sheets.
翻译:电子表格公式预测是许多真实世界应用程序中一个重要的程序合成问题。 以往的工作通常使用输入输出示例作为电子表格公式合成的规格, 每个输入- 输出配对在电子表格中模拟单独的行。 但是, 这一配方没有完全捕捉到真实世界电子表格中丰富的背景。 首先, 电子表格数据条目是按表格编排的, 因此行和列不一定彼此独立。 此外, 许多电子表格表格表格包括头, 提供对单元格数据的高层次描述。 但是, 以前的合成方法并不将信头视为规格的一部分。 在这项工作中, 我们提出了从表格背景中合成电子表格公式的第一个方法, 包括信头和半结构化的表格数据。 特别是, 我们提出了基于电子表格的模型结构, 在基于行和基于列的格式格式的表格中代表表格背景。 我们用一个大型电子表格数据集来培训我们的模型, 并证明电子表格科德实现了42.51%的上层预测准确度, 相对于不使用富有的表格格式用户的基线有相当大的改进。 比较了82 表格格式, 与基于电子表格的通用格式, 比较了82 格式 。