Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Every time you send a request to a model you're already providing all of the context history along with it. To edit the context, just send a different context history. You can send whatever you want as history, it's entirely up to you and entirely arbitrary.

We only think in conversational turns because that's what we've expected a conversation to 'look like'. But that's just a very deeply ingrained convention.

Forget that there is such a thing as 'turns' in a LLM convo for now, imagine that it's all 'one-shot'.

So you ask A, it responds A1.

But when you and B, and expect B1 - which depends on A and A1 already being in the convo history - consider that you are actually sending that again anyhow.

Behind the scenes when you think you're sending just 'B' (next prompt) you're actually sending A + A1 + B aka including the history.

A and A1 are usually 'cached' but that's not the simplest way to do it, the caching is an optimization.

Without caching the model would just process all of A + A1 + B and B1 in return just the same.

And then A + A1 + B + B1 + C and expect C1 in return.

It just so happens it will cache the state of the convo at your previous turn, and so it's optimized but the key insight is that you can send whatever context you want at any time.

If after you send A + A1 + B + B1 + C and get C1, if you want to then send A + B + C + D and expect D1 ... (basically sending the prompts with no responses) - you can totally do that. It will have to re-process all of that aka no cached state, but it will definitely do it for you.

Heck you can send Z + A + X, or A + A1 + X + Y - or whatever you want.

So in that sense - what you are really sending (if you're using the simplest form API), is sending 'a bunch of content' and 'expecting a response'. That's it. Everything is actually 'one shot' (prefill => response) and that's it. It feels conversational but structural and operational convention.

So the very simple answer to your question is: send whatever context you want. That's it.



Bigger context makes responses slower.

Context is limited.

You do not want the cloud provider running a context compaction if you can control it a lot better.

There are even tips on when to ask the question like "send first the content then ask the question" vs. "ask the question then send the content"


When history is cached conversations tend not to be slower, because the LLM can 'continue' from a previous state.

So if there was already A + A1 + B + B1 + C + C1 and you asking 'D' ... well, [A->C1] is saved as state. It costs 10ms to prepare. Then, they add 'D' as your question and that will be done 'all tokens at once' in bulk - which is fast.

Then - they they generate D1 (the response) they have to do it one token at a time, which is slow. Each token has to be processed separately.

Also - even if they had to redo- all of [A->C1] 'from scratch' - its not that slow, because the entire block of tokens can be processed in one pass.

'prefill' (aka A->C1) is fast, which by the way is why it's 10x cheaper.

So prefill is 10x faster than generation, and cache is 10x cheaper than prefill as a very general rule of thumb.


Thats only the case with KV Cache and we do not know how and how long providers keep it.


Prefill is 10x faster than generation without caching, and 100x faster with caching - as a very crude measure. So it's not a matter of 'only the case'. Those are different scenarios. Some hosts are better than others with respect to managing caching, but the better one's provide decent SLA on that.


This is how I view it as well.

And... and...

This results in a _very_ deep implication, which big companies may not be eager to let you see:

they are context processors

Take it for what it is.


What you are trying to say is they are plagiarists and training on the input?

We know that already I don’t know why have to be quiet or hint at it, in fact they have been quite explicit about it.

Or is there some other context to your statement? Anyway that’s my “take that for what you will”.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: