Quit Emailing Yourself

The evolution of the Venice ingestion pipeline

6 min read | Saved February 14, 2026 | Copied!

venice 🤖 ingestion 🤖 optimization 🤖 data-storage 🤖 replication 🤖

Do you care about this?

This article details the improvements made to the Venice ingestion pipeline at LinkedIn, which now handles over 230 million records per second. It covers key optimizations, challenges with diverse workloads, and strategies for enhancing performance, particularly in bulk loading and active-active replication scenarios.

If you do, here's more

Venice is LinkedIn's open-source data storage platform, designed for online AI applications. Since its inception in 2016, it has expanded from a few data stores to over 2,600. The platform supports vital features like People You May Know and LinkedIn Learning. Recently, the Venice ingestion pipeline was enhanced to handle over 230 million records per second. This significant performance boost stems from architectural changes, optimizations, and the use of advanced features from related technologies.

The ingestion pipeline allows data to be written to Venice stores via bulk loads or near-real-time updates. A key component is the Venice Push Job (VPJ), which manages data production through a map-reduce framework. Challenges arise during data production, consumption, and persistence. Strategies to improve performance include increasing the number of partitions to enhance throughput, using shared consumer pools for scalability, and optimizing I/O performance with concurrent writes across multiple RocksDB instances. Memory overhead is minimized by leveraging RocksDB’s SSTFileWriter, which directly generates SST files during ingestion.

Venice also supports hybrid data stores, which combine bulk loads and near-real-time updates. Each bulk load creates a new store version and Kafka topic, with real-time data updates appended to maintain currency. However, this setup presents challenges in managing duplicates and log compaction in RocksDB. Optimizations focus on balancing write, read, and space amplification by tuning compaction triggers and integrating RocksDB's BlobDB for large objects. These adjustments ensure that Venice can efficiently manage the dual demands of high-volume data ingestion and real-time updates.

Questions about this article

No questions yet.