Quit Emailing Yourself

Evolution and Scale of Uber’s Delivery Search Platform

6 min read | Saved February 14, 2026 | Copied!

semantic-search 🤖 model-training 🤖 infrastructure 🤖 optimization 🤖 scalability 🤖

Do you care about this?

This article details how Uber Eats developed its semantic search system to improve order discovery and conversion rates. It covers the architecture, model training, and challenges faced while scaling the platform to handle diverse queries effectively.

If you do, here's more

Uber Eats relies heavily on an advanced delivery search platform, where the search function is critical for driving orders. Most transactions start with users searching for specific stores, dishes, or grocery items. Effective search improves conversion rates, basket quality, and speeds up ordering, particularly for complex queries or in multilingual contexts. Traditional lexical search methods struggle with variations in language and context, leading to poor results. In contrast, semantic search focuses on understanding the meaning behind queries, significantly enhancing user experience.

The architecture of Uber's semantic search employs a two-tower model that separates query and document embedding calculations. Query embeddings are processed in real time, while document embeddings are generated in batches. They utilize a QWEN® model, which is fine-tuned with proprietary data to enhance its performance across different markets. For training, they use PyTorch and Ray for distributed computation, handling large data efficiently with techniques like mixed-precision training and gradient accumulation. The system incorporates a sophisticated indexing strategy using Apache Lucene® Plus, with separate indices for restaurants and grocery retail, addressing the scale of data they manage.

Uber faced several challenges in optimizing their semantic search. They had to balance retrieval accuracy with operational costs, fine-tuning parameters like the number of nearest neighbors (K) and embedding dimensions. For instance, reducing K from 1,200 to 200 resulted in a 34% drop in latency and 17% CPU savings, maintaining high-quality recall. They also explored quantization strategies, finding that scalar quantization cut compute costs significantly while keeping recall rates above 0.95. The use of Matryoshka Representation Learning allowed them to flexibly adjust embedding sizes, optimizing for speed and quality without needing to retrain models. By implementing locale-aware pre-filters, they managed to streamline candidate sets, ensuring rapid retrieval even from massive document collections.

Questions about this article

No questions yet.