Quit Emailing Yourself

Deep Dive Into Hudi's Indexing Subsystem (Part 2 of 2) | Apache Hudi

6 min read | Saved February 14, 2026 | Copied!

indexing 🤖 hudi 🤖 query-optimization 🤖 async-indexing 🤖 metadata 🤖

Do you care about this?

This article explains Hudi's advanced indexing features, focusing on record and secondary indexes for efficient query processing. It also covers expression indexes for transformed queries and the async indexing process that allows background index building without disrupting operations.

If you do, here's more

Hudi's indexing subsystem is designed for efficient data retrieval, focusing on various index types to handle specific query patterns. The record and secondary indexes are essential for queries with equality-matching predicates, such as A = X or B IN (X, Y, Z). The record index maps record keys to their exact file locations, while the secondary index maps non-record-key column values to their corresponding record keys. This design improves query performance significantly. For instance, a query filtering on a record key reduced execution time from 977 seconds to just 12 seconds, marking a 98% improvement. Using the secondary index can enhance performance by about 45% on average and cut down the data scanned by up to 90%.

Expression indexes address queries that involve transformations on column values, like from_unixtime() or substring(). Hudi supports two types: column stats and bloom filter expression indexes. The column stats index maintains file-level statistics for transformed values, while the bloom filter index uses a space-efficient structure to check for presence quickly. This approach allows the query planner to skip files that don't contain target values, especially effective with high-cardinality columns. SQL commands for creating these indexes are straightforward, allowing users to integrate them seamlessly into their data workflows.

Hudi also features an async indexing mechanism, which lets users build indexes in the background, ensuring that read and write operations remain uninterrupted. This flexibility in managing indexes—through SQL DDL commands or programmatic configurations—enhances both performance and usability. As Hudi continues to evolve, the indexing subsystem's capabilities will likely expand, offering even more efficient ways to handle complex queries.

Questions about this article

No questions yet.