6 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
Yelp revamped its data infrastructure by transitioning to a streaming lakehouse architecture on AWS. This change addressed latency issues, reduced operational complexity, and improved data governance, resulting in analytics data processing times dropping from 18 hours to minutes.
If you do, here's more
Yelp faced significant challenges in managing its data infrastructure, processing over 300 million business reviews and countless other user interactions daily. Their original data pipeline, built in 2015, struggled with latency issues, particularly with Kafka for both data streaming and storage. As their data needs evolved, the system became complex, affecting observability and governance, and making it harder to comply with regulations like GDPR. The reliance on Kafka for ingestion and storage created bottlenecks, leading to analytics data latencies of up to 18 hours.
To address these issues, Yelp transitioned to a streaming lakehouse architecture on AWS, which allowed for real-time processing at a fraction of the previous costs. They migrated from a self-managed Apache Kafka setup to Amazon Managed Streaming for Kafka (Amazon MSK), which reduced operational overhead and improved security. The new architecture, dubbed a "streamhouse," treats streaming as a first-class citizen, enabling minute-level processing latency and more efficient change data capture (CDC) from their MySQL databases. This modernization eliminated the need for Kafka as a permanent storage solution and simplified the overall data ecosystem, allowing for better performance and reduced complexity in their data operations.
Questions about this article
No questions yet.