5 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article explains how LinkedIn improved the response time of its Hiring Assistant AI by implementing speculative decoding. The technique allows the model to draft and verify multiple tokens simultaneously, significantly reducing latency while maintaining output quality.
If you do, here's more
Large language models (LLMs) can struggle with latency, particularly in real-time applications like LinkedIn’s Hiring Assistant, which requires quick, conversational responses. To tackle this challenge, LinkedIn implemented speculative decoding, a technique that improves text generation speed without sacrificing quality. Speculative decoding allows the model to draft multiple tokens at once and verify them in parallel, significantly reducing the time spent on generating each token. If the guesses turn out to be incorrect, the system can revert to the last verified token, ensuring the output remains consistent with what the base model would produce.
In practice, LinkedIn used two approaches for speculative decoding: n-gram speculation and draft-model speculation. For Hiring Assistant, n-gram speculation proved to be the best fit because the output often follows structured patterns and reuses phrases from job descriptions and candidate profiles. This method leverages the predictable nature of the inputs to achieve high acceptance rates for speculative tokens. LinkedIn configured several parameters to optimize performance, including the number of speculative tokens drafted and the maximum and minimum lengths of n-grams used.
The results of these optimizations were significant. LinkedIn reported nearly four times the throughput while maintaining the same quality and strict latency requirements. They achieved a 66% reduction in P90 end-to-end latency, allowing the Hiring Assistant to handle more conversations simultaneously without delays. The verification process is efficient, making it cost-effective and ensuring that the final output stays true to the base model’s distribution. N-gram speculation is especially useful for tasks that involve repetitive phrases or structured outputs, such as summarization and multi-turn conversations, providing a straightforward way to enhance performance without the complexity of additional models.
Questions about this article
No questions yet.