Neat. Very similar to tree-based speculation as they point out, and they also point how to combine them.
Speculative decoding: Sample a linear output (next n tokens) from draft model, submit it to a verifier model. At some index the verifier might reject a token and say that no, actually the next token should be this other token instead ("bonus token" in this paper), and that's your output. Or if it accepts the whole draft, you still get a bonus token as the next token past the draft. Then you draft again from that prefix on.
Tree-based speculation: Sample a tree of outputs from draft model, submit whole tree to verifier, pick longest accepted prefix (and its bonus token).
Speculative speculative decoding: Sample a linear output from draft model, then in parallel both verify it with the verifier model, and produce a tree of drafts branching out from different rejection points and different choices of bonus tokens at those points. When the verifier finishes, you might have have a new draft ready to submit right away.
Combined: Sample a tree from the draft model, submit the whole tree to the verifier and in parallel also plan out drafts for different rejection points with different bonus tokens anywhere in the tree.
> Our implementation is up to 2x faster than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding with open source inference engines
Note that a similar idea had already been suggested by Shen et al. (2025) in Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism (https://arxiv.org/abs/2506.01979), but with lower performance.
This is interesting stuff. I wonder if these sorts of tricks are already in use at the big labs.
Incidentally, I would recommend trying implementing speculative decoding yourself if you really want to understand LLM inference internals (that, and KV caching of course). I tried it over the Christmas holidays and it was a wonderful learning experience. (And hard work, especially because I forced myself to do it by hand without coding agent assistance.)
i think this matters more for lower batch sizes (local llm and private enterprise deployment where there wont be big user at specific time for big batch size) going from mem Io bottleneck to compute.
Speculative decoding: Sample a linear output (next n tokens) from draft model, submit it to a verifier model. At some index the verifier might reject a token and say that no, actually the next token should be this other token instead ("bonus token" in this paper), and that's your output. Or if it accepts the whole draft, you still get a bonus token as the next token past the draft. Then you draft again from that prefix on.
Tree-based speculation: Sample a tree of outputs from draft model, submit whole tree to verifier, pick longest accepted prefix (and its bonus token).
Speculative speculative decoding: Sample a linear output from draft model, then in parallel both verify it with the verifier model, and produce a tree of drafts branching out from different rejection points and different choices of bonus tokens at those points. When the verifier finishes, you might have have a new draft ready to submit right away.
Combined: Sample a tree from the draft model, submit the whole tree to the verifier and in parallel also plan out drafts for different rejection points with different bonus tokens anywhere in the tree.
reply