Database Configuration, Workload parametrizations, and Cloud Configuration Tuning: A Technical Analysis for Optimizing Distributed Databases Performance
Database Management Systems (DBMS) play a crucial role in facilitating efficient data management for businesses, necessitated by evolving business demands, burgeoning data complexity, and the growth of applications. For years, relational databases, utilizing Structured Query Language (SQL), have been the conventional choice for data modeling. However, with the exponential surge of both structured and unstructured data, commonly referred to as Big Data, SQL-based databases exhibit diminished efficiency due to their normalized data model and rigorous support for ACID properties, especially concerning larger databases. The advent of NoSQL (Not Only SQL) databases addresses these limitations, offering horizontal scalability through the utilization of distributed clusters and cost-effective servers. However, achieving optimal performance in modern DBMSs proves challenging, given the multitude of configurable parameters, including hardware and software setups, physical and logical database design, and more. These parameters significantly impact database performance metrics such as throughput and latency.
In this project, we present a detailed analysis for the automatic tuning of parameters in DBMSs, emphasizing the challenges presented by a vast and intricate parameter space, interdependent parameters, and varying workloads. It highlights the detailed analysis of different cloud configurations along with workload and database parameters. The study focuses on optimizing cloud resource configurations and parameters of distributed databases according to the requirements of the applications and attributes such as performance, availability, durability etc. The evaluation is conducted using geo-distributed fragmented hybrid clouds and a modular framework is developed to automate the experiments. In this study, we have considered different cluster sizes, however total amount of CPU, RAM and Disk is same, four databases (MongoDB, Cassandra, Redis, and MySQL) and workloads used are write intensive, read intensive, and scan heavy. The study analyzes the influence of configuration parameters, explores the effects of cloud configurations, workload characteristics, and database parameters on throughput. By addressing the limitations of existing research, the study aims to provide in-depth analysis of workload and database parametrizations on the performance distributed databases in varied cloud environment.
Project Members
- Shagun Dhingra
- Victor Prokhorenko
- Trung Ky Moc
- Limeng Zhang