Every time you send a request to a model you're already providing all of the con...

Fazebooking · 2026-01-15T12:22:54 1768479774

Bigger context makes responses slower.

Context is limited.

You do not want the cloud provider running a context compaction if you can control it a lot better.

There are even tips on when to ask the question like "send first the content then ask the question" vs. "ask the question then send the content"

bluegatty · 2026-01-16T10:50:03 1768560603

When history is cached conversations tend not to be slower, because the LLM can 'continue' from a previous state.

So if there was already A + A1 + B + B1 + C + C1 and you asking 'D' ... well, [A->C1] is saved as state. It costs 10ms to prepare. Then, they add 'D' as your question and that will be done 'all tokens at once' in bulk - which is fast.

Then - they they generate D1 (the response) they have to do it one token at a time, which is slow. Each token has to be processed separately.

Also - even if they had to redo- all of [A->C1] 'from scratch' - its not that slow, because the entire block of tokens can be processed in one pass.

'prefill' (aka A->C1) is fast, which by the way is why it's 10x cheaper.

So prefill is 10x faster than generation, and cache is 10x cheaper than prefill as a very general rule of thumb.

Fazebooking · 2026-01-16T13:12:05 1768569125

Thats only the case with KV Cache and we do not know how and how long providers keep it.

bluegatty · 2026-01-16T22:31:28 1768602688

Prefill is 10x faster than generation without caching, and 100x faster with caching - as a very crude measure. So it's not a matter of 'only the case'. Those are different scenarios. Some hosts are better than others with respect to managing caching, but the better one's provide decent SLA on that.

_bobm · 2026-01-15T09:14:07 1768468447

This is how I view it as well.

And... and...

This results in a _very_ deep implication, which big companies may not be eager to let you see:

they are context processors

Take it for what it is.

derrida · 2026-01-15T10:12:49 1768471969

What you are trying to say is they are plagiarists and training on the input?

We know that already I don’t know why have to be quiet or hint at it, in fact they have been quite explicit about it.

Or is there some other context to your statement? Anyway that’s my “take that for what you will”.