Click any tag below to further narrow down your results
Links
This article explains Hudi's advanced indexing features, focusing on record and secondary indexes for efficient query processing. It also covers expression indexes for transformed queries and the async indexing process that allows background index building without disrupting operations.
This article explains how Datadog's Database Monitoring now supports automatic collection of PostgreSQL's EXPLAIN ANALYZE plans. It helps users identify performance issues in queries by correlating execution details with application performance monitoring (APM) data. The tool also visualizes data to simplify the analysis of slow queries.
Apache Iceberg's statistics play a crucial role in optimizing query performance by enabling data skipping and efficient query planning. The article details the different types of statistics, including data-level and metadata-level stats, their functionalities, and how they can be configured to enhance performance in large-scale analytics environments. Understanding these statistics allows users to better tune their systems as workloads evolve.
CedarDB, a new Postgres-compatible database developed from research at the Technical University of Munich, showcases impressive capabilities in query decorrelation. The author shares insights from testing CedarDB's handling of complex SQL queries, noting both strengths in its query planner and some early-stage issues. Overall, there is optimism about CedarDB's future as it continues to evolve.
The article discusses techniques for enhancing query performance in PostgreSQL by manipulating its statistics tables. It explains how to use these statistics effectively to optimize query planning and execution, ultimately leading to faster data retrieval. Practical examples and insights into the PostgreSQL system are provided to illustrate these methods.
OSS Vanilla Spark is a versatile distributed query engine capable of handling various workloads but is generally slower than pure vectorized engines like Trino or Snowflake for OLAP tasks due to its hybrid processing model. While Spark's approach allows for flexibility in processing semi-structured data and complex queries, it lacks the optimization specific to columnar data formats. The article also discusses potential enhancements to transform Spark into a more vectorized engine through various extensions and solutions.
External indexes, metadata stores, catalogs, and caches can significantly enhance query performance on Apache Parquet by allowing efficient data retrieval without the need for extensive reparsing. The blog discusses how to implement these components using Apache DataFusion to optimize custom data platforms for specific use cases. It also highlights the advantages of Parquet's hierarchical data organization and its compatibility with various indexing strategies.
The article discusses the structural differences between various query operators, specifically focusing on index nested loops joins and hash joins. It emphasizes the importance of understanding these operators' internal structures during query planning to optimize execution, highlighting how this knowledge can lead to more efficient query performance. The piece also touches on the implications of treating operators as black boxes versus recognizing their specific functionalities.
The article delves into the complexities of StarRocks' implementation of Iceberg's Merge-on-Read (MoR) functionality, specifically focusing on how it efficiently manages deletes with positional and equality delete files. It explores the intricacies of query planning, the role of queue structures in processing, and the handling of schema evolution, all while shedding light on the technical challenges encountered during the exploration of the system's codebase.
SQL query optimization involves the DBMS determining the most efficient plan to execute a query, with the query optimizer responsible for evaluating different execution plans based on cost. The Plan Explorer tool, implemented for PostgreSQL, visualizes these plans and provides insights into the optimizer's decisions by generating various diagrams. The tool can operate in both standalone and server modes, enabling deeper analysis of query execution and costs.