Quit Emailing Yourself

Prompt caching: 10x cheaper LLM tokens, but how? | ngrok blog

7 min read | Saved February 14, 2026 | Copied!

prompt-caching 🤖 llms 🤖 tokenization 🤖 embedding 🤖 latency 🤖

Do you care about this?

The article explains how prompt caching works in large language models (LLMs) like those from OpenAI and Anthropic. It details the process of tokenization and embedding, illustrating how caching reduces costs and latency. The author shares insights from personal testing and dives into the mechanics behind LLM operations.

If you do, here's more

Cached input tokens for OpenAI and Anthropic's APIs are currently ten times cheaper than regular tokens. Anthropic claims that prompt caching can cut latency by up to 85% for lengthy prompts, a finding that aligns with personal tests showing significant reductions in time-to-first-token latency when every input token is cached. Understanding what a cached token actually is and how it benefits both providers and users requires some technical insight into large language models (LLMs).

At its core, an LLM is a complex mathematical function processing sequences of numbers as input to produce output. The process involves several stages, starting with tokenization, where text is broken down into smaller chunks, each assigned a unique integer ID. For example, the phrase “Check out ngrok.ai” gets tokenized into specific integers, ensuring consistent representation across requests. Tokens are critical since they serve as the basic units for both input and output. When generating responses, LLMs stream output one token at a time, allowing for a more interactive experience, even though full responses can take a while to generate.

The article emphasizes that prompt caching occurs in the attention mechanism of transformers within LLMs. This mechanism allows the model to retain context from prior tokens, which is essential for producing coherent responses. Each iteration appends the output token to the input, ensuring the model has all necessary context. The end of a response is signaled by special tokens, like the end token in GPT-5. Understanding these mechanisms clarifies how prompt caching not only speeds up responses but also reduces costs associated with LLM usage.

Questions about this article

No questions yet.