The main constraint on consumer GPUs is the VRAM - you can pretty much always do...

j-pb · on Dec 2, 2024

You can run Models up to 128GB on a MacBook Pro Max. So we're already at a point where you can run all but the biggest frontier models on consumer hardware.

ben_w · on Dec 2, 2024

Given the price tag, I don't think I'd call that "consumer" hardware, but rather "professional" hardware.

But perhaps that's just me…

menaerus · on Dec 2, 2024

Yeah, I also think that the ~5k price is quite hefty. It's difficult for me to imagine that running sizeable LLMs on commodity/consumer hardware will be possible without another breakthrough in the field. The prices of GPUs I wouldn't expect to fall if technology proves its worthiness.

robotresearcher · on Dec 2, 2024

You're predicting the price of computer chips will not fall? They're just about the most price-fally truly useful thing in history.

ben_w · on Dec 2, 2024

They have been to date.

Massive increases in demand due to this stuff being really really useful can cause prices to go up even for existing chips (NVIDIA is basically printing money as they can sell all they can make at for as much money as the buyers can get from the investors). I have vague memories of something like this happening with RAM in the late 90s, but perhaps it was just Mac RAM because the Apple market was always its own weird oddity (the Performa 5200 I bought around then was also available in the second hand listings on one of the magazines for twice what I paid for it).

Likewise prices can go up from global trade wars, e.g. like Trump wants for profit and Biden wants specifically to limit access to compute because AI may be risky.

Likewise hot wars right where the chips are being made, say if North Korea starts fighting South Korea again, or if China goes for Taiwan.

menaerus · on Dec 3, 2024

> I have vague memories of something like this happening with RAM in the late 90

We don't even need to go that far in the history. Crypto hype just few years ago skyrocketed the GPU prices.

menaerus · on Dec 2, 2024

Yes, I am.

ElevenLathe · on Dec 2, 2024

I can imagine a world where "good enough" GPGPUs become embedded in common chipsets the same way "good enough" regular GPUs are embedded now, but we're definitely not there yet. That said, it was only a few years between the VooDoo cards coming to market and Intel integrated graphics showing up.

menaerus · on Dec 2, 2024

We already have something similar in terms of HW accelerators for AI workloads in recent CPU designs but that's not enough.

LLM inference workloads are bound by the compute power, sure, but that's not insurmountable IMO. Much bigger challenge is memory. Not even the bandwidth but just a sheer amount of RAM you need to just load the LLM weights.

Specifically, even a single H100 will hardly suffice to host a mid-sized LLM such as llama3.1-70B. And H100 is ~50k.

If that memory amount requirement is there to stay, and with current LLM transformer architecture it is, then what is really left as an only option for affordable consumer HW are only the smallest and least powerful LLMs. I can't imagine having a built-in GPGPU with 80G of on-die memory. IMHO.

supermatt · on Dec 2, 2024

> more consumer hardware

jorvi · on Dec 2, 2024

AMD Radeon series ≥6800 & ≥7800 have 16GB VRAM too.

8jef · on Dec 2, 2024

Even RX 7600 XT has 16GB

jorvi · on Dec 2, 2024

I wonder if a 7600 XT is a cut-down 7800 XT then, because both normal and XT variants of the 6700 and 7700 only have 12GB VRAM.

Nonetheless, great info. Sounds like it might be the budget inference king!

Numerlor · on Dec 2, 2024

Completely different chips; the VRAM differences are from how GDDR can be used, with either 1 or 2 chips on a single 32bit bus, the configuration with 2 chips is called clamshell. The 7800 XT and 7600 XT have same VRAM but the 7800 XT has a 256 bit memory bus while the 7600 XT has a 128 bit memory bus. Meanwhile the 7700 XT with 12 GB is on a 192 bit memory bus.

The workstation edition of GPUs usually do the clamshell configuration so they can easily double the VRAM and ramp up the price by a couple thousand

concerndc1tizen · on Dec 2, 2024

Does this mean that the model will be half the size?

If a 32B model@4bit normally requires 16 GB VRAM, at half the size, it could be run @8bit with 16 GB VRAM?

Isn't that tradeoff a great improvement? I assume the improved bit precision will more than compensate for the loss related to removal?

int_19h · on Dec 2, 2024

There is some improvement going from 4-bit to 8-bit quantization, but if you have VRAM to spare for that, you usually see more benefit from running a 2x larger model at 4-bit. So in scenarios where an LM already fits the existing VRAM budget, I would expect larger models instead.

The other thing is that VRAM is used not just for the weights, but also for prompt processing, and this last part grows proportionally as you increase the context size. For example, for the aforementioned QwQ-32, with base model size of ~18Gb at 4-bit quantization, the full context length is 32k, and you need ~10Gb extra VRAM on top of weights if you intend to use the entirety of that context. So in practice, while 30b models fit into 24Gb (= a single RTX 3090 or 4090) at 4-bit quantization, you're going to run out of VRAM once you get past 8k context. Thus the other possibility is that VRAM saved by tricks like sparse models can be used to push that further - for many tasks, context size is the limiting factor.

bombela · on Dec 2, 2024

For readability, I recommend reserving "b" for bits, "B" for byte, "p" for parameter.

I assume in your post that "30b" meant 30 billion, or in other words, 30Gp (giga-parameter).

Furthermore is 24Gb of VRAM 24 gigabits (power of 10), or 24 gibibits (power of 2)?

int_19h · on Dec 3, 2024

For readability I'm using the same convention that is generally used for these applications, where if you see "-Nb" after a model name, it always refers to the number of parameters. I have never once seen "p" for "parameter", never mind terms like "giga-parameter". Most certainly if you go searching for models on HuggingFace etc, you'll have to deal with "30b" etc terminology whether you like it or not.

With VRAM, this quite obviously refers to the actual amount that high-end GPUs have, and I even specifically listed which ones I have in mind, so you can just look up their specs if you genuinely don't know the meaning in this context.