NoSQL databases have become an important component of many big data and real-time web applications. Their distributed nature and scalability make them an ideal data storage repository for a variety of use cases. While NoSQL databases are delivered with a default ''off-the-shelf'' configuration, they offer configuration settings to adjust a database's behavior and performance to a specific use case and environment. The abundance and oftentimes imperceptible inter-dependencies of configuration settings make it difficult to optimize and performance-tune a NoSQL system. There is no one-size-fits-all configuration and therefore the workload, the physical design, and available resources need to be taken into account when optimizing the configuration of a NoSQL database. This work explores Machine Learning as a means to automatically tune a NoSQL database for optimal performance. Using Random Forest and Gradient Boosting Decision Tree Machine Learning algorithms, multiple Machine Learning models were fitted with a training dataset that incorporates properties of the NoSQL physical configuration (replication and sharding). The best models were then employed as surrogate models to optimize the Database Management System's configuration settings for throughput and latency using a Black-box Optimization algorithm. Using an Apache Cassandra database, multiple experiments were carried out to demonstrate the feasibility of this approach, even across varying physical configurations. The tuned DBMS configurations yielded throughput improvements of up to 4%, read latency reductions of up to 43%, and write latency reductions of up to 39% when compared to the default configuration settings.
翻译:NOSQL 数据库已成为许多大数据和实时网络应用程序的重要组成部分。 它们分布的性质和可缩放性使得它们成为各种使用案例的理想数据存储库。 虽然 noSQL 数据库的交付带有默认的“ 现成” 配置, 但是它们提供了配置设置, 以调整数据库的行为和性能以适应特定使用案例和环境。 配置设置的丰富性和经常不易察觉的相互依存性使得难以优化和运行一个 NSQL 系统。 没有一刀切的配置和缩放性能, 因此它们成为各种使用案例的理想数据存储库。 当优化 NSQL 数据库的配置时, 需要考虑无SQL 数据库的配置、 物理设计和可用的资源。 这项工作探索机器学习, 以自动调整数据库的 NSQL 数据库的行为和性能, 使用随机的调控的调色调工具, 多个机器学习模型安装了包含 NSQL 物理配置的性能和性能改进( 复制和裁剪裁量) 。 随后, 将最佳模型用于使用代理性配置模型, 将缩写到优化数据库管理系统。