1 link tagged with all of: tokenization + kv-cache + transformer + quantization
Click any tag below to further narrow down your results
Links
This article breaks down how an LLM turns your prompt into streamed tokens, covering tokenization, embeddings, transformer attention, and the two-phase pipeline of compute-bound prefill and memory-bound decode. It explains KV caching, quantization, and metrics like Time to First Token and Inter-Token Latency to show why inference speed depends on both compute and memory.