Group-by-aggregate (GBA) queries are integral to data analysis, allowing users to group data by specific attributes and apply aggregate functions such as sum, average, and count. Database Management Systems (DBMSs) typically execute GBA queries using either sort- or hash-based methods, each with unique advantages and trade-offs. Sort-based approaches are efficient for large datasets but become computationally expensive due to record comparisons, especially in cases with a small number of groups. In contrast, hash-based approaches offer faster performance in general but require significant memory and can suffer from hash collisions when handling large numbers of groups or uneven data distributions. This paper presents a focused empirical study comparing these two approaches, analyzing their strengths and weaknesses across varying data sizes, datasets, and group counts using Apache AsterixDB. Our findings indicate that sort-based methods excel in scenarios with large datasets or when subsequent operations benefit from sorted data, whereas hash-based methods are advantageous for smaller datasets or scenarios with fewer groupings. Our results provide insights into the scenarios where each method excels, offering practical guidance for optimizing GBA query performance.
翻译:暂无翻译