生成对抗网络基础上的查询生成 (Query Generation based on Generative Adversarial Networks)

Many problems in database systems, such as cardinality estimation, database testing and optimizer tuning, require a large query load as data. However, it is often difficult to obtain a large number of real queries from users due to user privacy restrictions or low frequency of database access. Query generation is one of the approaches to solve this problem. Existing query generation methods, such as random generation and template-based generation, do not consider the relationship between the generated queries and existing queries, or even generate semantically incorrect queries. In this paper, we propose a query generation framework based on generative adversarial networks (GAN) to generate query load that is similar to the given query load. In our framework, we use a syntax parser to transform the query into a parse tree and traverse the tree to obtain the sequence of production rules corresponding to the query. The generator of GAN takes a fixed distribution prior as input and outputs the query sequence, and the discriminator takes the real query and the fake query generated by the generator as input and outputs a gradient to guide the generator learning. In addition, we add context-free grammar and semantic rules to the generation process, which ensures that the generated queries are syntactically and semantically correct. We conduct experiments to evaluate our approach on real-world dataset, which show that our approach can generate new query loads with a similar distribution to a given query load, and that the generated queries are syntactically correct with no semantic errors. The generated query loads are used in downstream task, and the results show a significant improvement in the models trained with the expanded query loads using our approach.

翻译：数据库系统中的许多问题，如基数估计、数据库测试和优化器调整，需要大量的查询负载数据。然而，由于用户隐私限制或数据库访问频率较低，很难从用户那里获取大量真实的查询负载数据。查询生成是解决这个问题的方法之一。现有的查询生成方法，如随机生成和基于模板的生成，没有考虑所生成的查询与现有查询之间的关系，甚至会生成语义不正确的查询。本文提出了一种基于生成对抗网络（GAN）的查询生成框架，以生成与给定查询负载类似的查询负载。在我们的框架中，我们使用语法分析器将查询转换为解析树，并遍历树以获取对应于查询的生产规则序列。GAN的生成器将固定分布先验作为输入，并输出查询序列，鉴别器将真实查询和生成器生成的虚假查询作为输入，并输出渐变以指导生成器的学习。此外，我们在生成过程中添加了上下文无关文法和语义规则，确保所生成的查询在语法上和语义上都是正确的。我们进行了实验来评估我们的方法在实际数据集上的表现，结果显示我们的方法可以生成具有与给定查询负载类似分布的新查询负载，并且所生成的查询在语法上没有语义错误。生成的查询负载用于下游任务，结果显示使用我们的方法扩展查询负载训练的模型具有显著的改进。