It used to be that for locally running GenAI, VRAM per dollar was king, so used NVidia RTX 3090 cards were the undisputed darlings of DYI LLM with 24GB for 600€-800€ or so. Sticking two of these in one PC isn't too difficult despite them using 350W each.
Then Apple introduced Macs with 128 GB and more unified memory at 800GB/s and the ability to load models as large as 70GB (70b FP8) or even larger ones. The M1 Ultra was unable to take full advantage of the excellent RAM speed, but with the M2 and the M3, performance is improving. Just be prepared to spend 5000€ or more for a M3 Ultra. Another alternative would be a EPYC 9005 system with 12x DDR5-6000 RAM for 576GB/s of memory bandwidth with the LLM (preferably MoE) running on the CPU instead of a GPU.
However today, with the latest, surprisingly good reasoning models like QwQ-32B using up thousands or tens of thousands of tokens in their replies, performance is getting more important than previously and these systems (Macs and even RTX 3090s) might fall out of favor, because waiting for a finished reply will take several minutes or even tens of minutes. Nvidia Ampere and Apple silicon (AFAIK) are also missing FP4 support in hardware, which doesn't help.
For the same reason AMD Halo Strix with a mere 273GB/s of RAM bandwidth and perhaps also NVidia Project Digits (also speculated to offer similar RAM bandwidth) might just be too slow for reasoning models with more than 50GB or so of active parameters.
On the other hand, if the prices for the RTX 5090 remain at 3500€, they will likely remain insignificant for the DIY crowd for that reason alone.
Perhaps AMD will take the crown with a variant of their RDNA4 RX 9070 card with 32GB of VRAM priced at around 1000€? Probably wishful thinking…
For how much? From where would one obtain them? Are they legit? I have found a lot of Chinese knock-offs for various GPUs for super cheap, but they were faulty (in my experience).
The 96GB 4090 is still at the "blurry shots of nvidia-smi on Chinese TikTok" stage.
Granted the 48 GB 4090s started there too before materializing and even becoming common enough to be on eBay, but this time there are more technical barriers and it's less likely they'll ever show up in meaningful numbers.
QwQ-32B on 3*1080ti does work with a 4096 window @ ~14 eval tokens/s
But the lower bounds on intermediate tokens on CoT to get close to PTIME expressability + QwQ-32B's verbosity/repeating on those tokens does eat up that pretty quickly.
In theory the DAG walking QwQ-32B appears to be doing should require O(|E| log|V|) scratch space. IMHO they need to train to target that more.
It seems fairly trivial to extend the context loss problem to the public UI, The Ω(n) scratch space may be hard to hit in reality but they seem almost exponential in scratch space right now, even on tasks that a traditional LLM could answer with just approximate retrieval.
I may just be suffering confirmation bias, but the problem with memory in this case seems to be related to making the intermediate tokens palatable to users IMHO.
I've heard however that the EPYC rigs are in practice getting very low tokens/sec and the Macs like the Ultras and high memory are getting much beter - like by an order of magnitude. So in that sense, the only sensible option now (i.e. "local energy efficient LLM on a budget") is to get the Mac.
The previous time this article was submitted, I did some calculations based on the charts and found[1] that for the NVIDIA 40 and 50-series GPUs, the results are almost entirely explained by memory bandwidth:
Each of the cards except the 5090 gets almost exactly 0.1 token/s per GB/s memory bandwidth.
My understanding is that the Macs have soldered memory which allows for much higher memory bandwidth. The M4 has ~400-550 GB/s max depending on configuration[2], while EPYCs seem to have more like 250GB/s max[3].
Ah shoot, that's what one gets for being in a hurry and on the phone. Saw the date of the article and mention of the EPYC 9004, but forgot that it's the 9005 that's the new series and missed the details.
Thanks for the correction.
edit: found a llama.cpp issue discussing performance bottlenecks on modern dual-socket EPYC here[1]. Also includes single-socket benchmarks, and includes some optimizations. Just thought it was interesting.
> However today, with the latest, surprisingly good reasoning models like QwQ-32B using up thousands or tens of thousands of tokens in their replies, performance is getting more important than previously and these systems (Macs and even RTX 3090s) might fall out of favor, because waiting for a finished reply will take several minutes or even tens of minutes.
3090 is the last gen to have proper nvlink support, which is supported for LLM inference in some frameworks.
I was browsing r/localllm recently and was surprised that some people were purchasing two 3060 12GB and using them in tandem somehow. I actually didn't think this would work at all without nvlink.
> I actually didn't think this would work at all without nvlink.
It will work, but with less memory bandwidth. If using hacked drivers with p2p enabled, it will depend on the PCIe topology. Otherwise, the data will take a longer path.
Depending on the model, it may not be that big of a hit in performance. Personally, I have seen higher perf on nvlinked GPUs and models that can fit on 2x 3090 24gbs.
Yes, no one seemed to be bragging about much of a performance boost. I think they were happy being able to run models (at any speed) that needed 24GB without having to buy a 3090.
You just copy data over PCIe, however it is slower than nvlink. There are several ways to utilize multiple GPUs, the main contenders as I understand it is pipelined and so-called tensor parallelism. Think of it as slicing a loaf of bread in regular slices vs along its longer axis.
The former can have higher peak throughput while the latter can have lower latency, though it depends on the details[1].
That doesn't make sense to me. Memory bandwidth refers to the throughput of moving data from memory. Even if FP4 calculations aren't natively supported, you still move 4 bits of data per number from memory and then cast it to a FP8 or FP16 or other higher-precision number, right?
Performance for text generation is memory-limited, so lack of native fp8 support does not matter. You have more than enough compute left over to do the math in whichever floating point format you fancy.
Performance is good enough for non-reasoning models even if they're FP8 or FP4. Check the phoronix article, the difference between the 3090 and 4090 is rather small.
How? It's not like Nvidia is some also-ran company for which people did not build custom kernels that combine dequantization and GEMV/GEMM in a single kernel.
Sometimes I daydream about a world where GPUs just have the equivalent of LPCAMM and you could put in as much RAM as you can afford and as much as the hardware supports, much like is the case with motherboards, even if something along the way would bottleneck somewhat. It'd really extend the life of some hardware, yet companies don't want that.
That said, it's cool that you can even get an L4 with 24 GB of VRAM that actually performs okay, yet is passively cooled and consumes like 70W, at that point you can throw a bunch of them into a chassis and if you haven't bankrupted yourself by then, they're pretty good.
I did try them out on Scaleway, the pricing isn't even that exorbitant, using consumer GPUs for LLM use cases doesn't quite hit the same since.
Saw a video from Bolt Graphics, a startup trying to do that among other things. Supposedly they'll have demos at several conventions later this year, like at Hot Chips.
I think it's an incredibly tall order to get into the GPU game for a startup, but it should be good entertainment if nothing else.
Oh, I definitely am! It's always cool when someone with domain specific knowledge drops by and proceeds to shatter that dream with technical reasons that are cool nonetheless, the same way how LPCAMM2 doesn't work with every platform either and how VRAM has pretty stringent requirements.
That said, it's understandable that companies are sticking with whatever works for now, but occasionally you get an immensely cool project that attempts to do something differently, like Intel's Larrabee did, for example.
The benchmark only touches 8B-class models at 8-bit quantification. Would be interesting to see how it fares with models that use more of the card ram, and under varying quantization and context lengths.
I agree. This benchmark should have compared the largest ~4 bit quantized model that fits into VRAM, which would be somewhere around 32B for RTX 3090/4090/5090.
For text generation, which is the most important metric, the tokens per second will scale almost linearly with memory bandwidth (936 GB/s, 1008 GB/s and 1792 GB/s respectively), but we might see more interesting results when comparing prompt processing, speculative decoding with various models, vLLM vs llama.cpp vs TGI, prompt length, context length, text type/programming language (actually makes a difference with speculative decoding), cache quantization and sampling methods. Results should also be checked for correctness (perplexity or some benchmark like HumanEval etc.) to make sure that results are not garbage.
At time of writing, Qwen2.5-Coder-32B-Instruct-GGUF with one of the smaller variants for speculative decoding is probably the best local model for most programming tasks, but keep an eye out for any new models. They will probably show up in Bartowksi's "Recommended large models" list, which is also a good place to download quantized models: https://huggingface.co/bartowski
Using aider with local models is a very interesting stress case to add on top of this. Because the support for reasoning models is a bit rough, and they aren't always great at sticking to the edit format, what you end up doing is configuring different models for different tasks (what aider calls "architect mode").
I use ollama for this, and I'm getting useful stuff out of qwq:32b as the architect, qwen2.5-coder:32b as the edit model, and dolphin3:8b as the weak model (which gets used for things like commit messages). Now what that means is that performance swapping these models in and out of the card starts to matter, because they don't all go into VRAM at once; but also using a reasoning model means that you need straight-line tokens per second as well, plus well-tuned context length so as not to starve the architect.
I haven't investigated whether a speculative decoding setup would actually help here, I've not come across anyone doing that with a reasoner before now but presumably it would work.
It would be good to see a benchmark based on practical aider workflows. I'm not aware of one but it should be a good all-round stress test of a lot of different performance boundaries.
3090, the 24 GB of VRAM will be more relevant than running smaller models even faster on lower capacity card (small models already run fast by nature of being less to copy from memory and compute on).
3090. Models like flux or wanvideo with fp16 text encoders will push your VRAM usage over 20GB. LLMs with functional context (or workflows with multi-model, not modal, tie in) will all break 16GB.
There's no point to performance without context. At its pricepoint, the 5080 is purely a gaming card.
Do you think you can get a 5080 for a 1000?
In the Netherland they're going for 1600 euro's minimum, more like 1800. So I would totally get a new 5080 for a 1000 dollars.
> I have the impression that we'd usually expect to see bigger efficiency gains while these are marginal?
The 50-series is made using the same manufacturing process ("node") as the 40-series, and there is not a major difference in design.
So the 50-series is more like tweaking an engine that previously topped out at 5000 RPM so it's now topping out at 6000 RPM, without changing anything fundamental. Yes it's making more horsepower but it's using more fuel to do so.
When running on apple silicon you want to use mlx, not llama.cpp as this benchmark does. Performance is much better than what's plotted there and seems to be getting better, right?
Power consumption is almost 10x smaller for apple.
Vram is more than 10x larger.
Price wise for running same size models apple is cheaper.
Upper limit (larger models, longer context) is far larger for apple (for nvidia you can easily put 2x cards, more than that it becomes whole complex setup no ordinary person can do).
Am I missing something or apple is simply currently better for local llms?
there is a plateau where you simply need more compute and the m4 cores are not enough, so even if they have enough ram for the model the token/s is not useful
For all models fitting 2x 5090 (2x 32GB) that's not a problem, so you can say if you have this problem then RTX is also not an option.
On apple silicons you can always use MoE models, which work beautifully. On RTX it's kind of waste to be honest to run MoE, you'd be better off running single, whole active model that fills available memory (with enough space for the context).
I'm trying to find out about that as well as I'm considering a local LLM for some heavy prototyping. I don't mind which HW I buy, but it's on a relative budget and energy efficiency is also not a bad thing. Seems the Ultra can do 40 tokens/sec on DeepSeek and nothing even comes close at that price point.
The DeepSeek R1 distilled onto Llama and Qwen base models are also unfortunately called “DeepSeek” by some. Are you sure you’re looking at the right thing?
The OG DeepSeek models are hundreds of GB quantized, nobody is using RTX GPUs to run them anyway…
You are missing something. This is a single stream of inference. You can load up the Nvidia card with at least 16 inference streams and get at much higher throughout tokens/sec.
This just is just a single user chat experience benchmark.
They might be, depending on architecture - the ROPs are responsible for taking the output from the shader core and writing it back to memory, so can be used in compute shaders even if all the fancier "Raster Operation" modes aren't really used there. No point having a second write pipeline to memory when there's already one there. But if your usecase doesn't really pressure that side of things then even if some are "missing" it might make zero difference, and my understanding of most ML models is they're heavily read bandwidth biased.
I think if they had more than 24gb there would be more interest but the hacks and noise with trying to make server hardware work within a desktop case and the FP16 performance is extremely poor compared to a 3090 (183.7 GFLOPS vs 35.58 TFLOPS IIRC).
Coming I hope, though I wouldn't have huge expectations.
He already did the general compute benchmark of the two 9070 cards here[1], and between poorly optimized drivers and GDDR6's lower memory bandwidth, I wouldn't expect any great scores.
In terms of memory bandwidth they're between a 4070 and a 4070 Ti SUPER, and given that LLMs are very memory-bandwidth constrained as I mentioned in another comment, at best you'd expect the LLM score to end up between the 4070 and the 4070 Ti SUPER.
The 9070 cards are hard to find, and inflated 30% above MSRP right now. Apparently the supply of fuse-crippled 9070 cards was just used to up-sell 9070 XT (easier to find.)
Have a look at the GamersNexus YT rant about it... They make a fair argument in that price range an older model used RTX nvidia card may be a better value.
Depends on your use-case, as rtx 5090 nvidia AI frame interpolation is dog crap hype for CGI or CUDA accelerated ML libraries.
Personally, I would go with the rtx 4090 or even an rtx 3090 with 24G vram for ML and CGI workstation, as CUDA+Optix has better software support. For just gaming, the 9070 XT is a better deal when the MSRP is within range. Depends how willing you are to get ripped off by scalper prices right now. lol =3
Great to see Mr Larabel@Phoronix both maintaining consistent legit reporting and still have time for one-offs like this in these times of AI slop and other OG writers either quitting or succumbing to the vortex. Hats off!
I think that's underselling it. Performance is good, up by a significant margin, and the VRAM boost is well worth it. There's just no efficiency gain to go along with it.
It looks to me like you could think about it as a performance/VRAM/convenience stepping-stone between having one 4090 and having a pair.
Paired 5090s, if such a thing is possible, sounds like a very good way to spend a lot of money very quickly while possibly setting things on fire, and you'd have to have a good reason for that.
It used to be that for locally running GenAI, VRAM per dollar was king, so used NVidia RTX 3090 cards were the undisputed darlings of DYI LLM with 24GB for 600€-800€ or so. Sticking two of these in one PC isn't too difficult despite them using 350W each.
Then Apple introduced Macs with 128 GB and more unified memory at 800GB/s and the ability to load models as large as 70GB (70b FP8) or even larger ones. The M1 Ultra was unable to take full advantage of the excellent RAM speed, but with the M2 and the M3, performance is improving. Just be prepared to spend 5000€ or more for a M3 Ultra. Another alternative would be a EPYC 9005 system with 12x DDR5-6000 RAM for 576GB/s of memory bandwidth with the LLM (preferably MoE) running on the CPU instead of a GPU.
However today, with the latest, surprisingly good reasoning models like QwQ-32B using up thousands or tens of thousands of tokens in their replies, performance is getting more important than previously and these systems (Macs and even RTX 3090s) might fall out of favor, because waiting for a finished reply will take several minutes or even tens of minutes. Nvidia Ampere and Apple silicon (AFAIK) are also missing FP4 support in hardware, which doesn't help.
For the same reason AMD Halo Strix with a mere 273GB/s of RAM bandwidth and perhaps also NVidia Project Digits (also speculated to offer similar RAM bandwidth) might just be too slow for reasoning models with more than 50GB or so of active parameters.
On the other hand, if the prices for the RTX 5090 remain at 3500€, they will likely remain insignificant for the DIY crowd for that reason alone.
Perhaps AMD will take the crown with a variant of their RDNA4 RX 9070 card with 32GB of VRAM priced at around 1000€? Probably wishful thinking…
There are Chinese modded 4090s with 48 and 96 GB of ram that seem like a sweet spot for fast inference of these moderately sized models.
For how much? From where would one obtain them? Are they legit? I have found a lot of Chinese knock-offs for various GPUs for super cheap, but they were faulty (in my experience).
The 96GB 4090 is still at the "blurry shots of nvidia-smi on Chinese TikTok" stage.
Granted the 48 GB 4090s started there too before materializing and even becoming common enough to be on eBay, but this time there are more technical barriers and it's less likely they'll ever show up in meaningful numbers.
We've progressed to the "more detailed photos and screen grabs on Twitter" stage: https://x.com/bdsqlsz/status/1898307273967145350
QwQ-32B on 3*1080ti does work with a 4096 window @ ~14 eval tokens/s
But the lower bounds on intermediate tokens on CoT to get close to PTIME expressability + QwQ-32B's verbosity/repeating on those tokens does eat up that pretty quickly.
In theory the DAG walking QwQ-32B appears to be doing should require O(|E| log|V|) scratch space. IMHO they need to train to target that more.
It seems fairly trivial to extend the context loss problem to the public UI, The Ω(n) scratch space may be hard to hit in reality but they seem almost exponential in scratch space right now, even on tasks that a traditional LLM could answer with just approximate retrieval.
I may just be suffering confirmation bias, but the problem with memory in this case seems to be related to making the intermediate tokens palatable to users IMHO.
I've heard however that the EPYC rigs are in practice getting very low tokens/sec and the Macs like the Ultras and high memory are getting much beter - like by an order of magnitude. So in that sense, the only sensible option now (i.e. "local energy efficient LLM on a budget") is to get the Mac.
The previous time this article was submitted, I did some calculations based on the charts and found[1] that for the NVIDIA 40 and 50-series GPUs, the results are almost entirely explained by memory bandwidth:
Each of the cards except the 5090 gets almost exactly 0.1 token/s per GB/s memory bandwidth.
My understanding is that the Macs have soldered memory which allows for much higher memory bandwidth. The M4 has ~400-550 GB/s max depending on configuration[2], while EPYCs seem to have more like 250GB/s max[3].
[1]: https://news.ycombinator.com/item?id=42847284
[2]: https://support.apple.com/en-us/121553
[3]: https://www.servethehome.com/here-is-why-you-should-fully-po...
> EPYCs seem to have more like 250GB/s max
Your link goes to info on the 2022 EPYC CPUs, the current generation can do 576GB/s: https://chipsandcheese.com/p/amds-turin-5th-gen-epyc-launche...
Intel's current 12ch Xeons should be even faster with MRDIMMs, though I couldnt find a memory specific benchmark.
Ah shoot, that's what one gets for being in a hurry and on the phone. Saw the date of the article and mention of the EPYC 9004, but forgot that it's the 9005 that's the new series and missed the details.
Thanks for the correction.
edit: found a llama.cpp issue discussing performance bottlenecks on modern dual-socket EPYC here[1]. Also includes single-socket benchmarks, and includes some optimizations. Just thought it was interesting.
[1]: https://github.com/ggml-org/llama.cpp/discussions/11733
> However today, with the latest, surprisingly good reasoning models like QwQ-32B using up thousands or tens of thousands of tokens in their replies, performance is getting more important than previously and these systems (Macs and even RTX 3090s) might fall out of favor, because waiting for a finished reply will take several minutes or even tens of minutes.
3090 is the last gen to have proper nvlink support, which is supported for LLM inference in some frameworks.
Would 1x 5090 be faster than 2x 3090?
I was browsing r/localllm recently and was surprised that some people were purchasing two 3060 12GB and using them in tandem somehow. I actually didn't think this would work at all without nvlink.
> I actually didn't think this would work at all without nvlink.
It will work, but with less memory bandwidth. If using hacked drivers with p2p enabled, it will depend on the PCIe topology. Otherwise, the data will take a longer path.
Depending on the model, it may not be that big of a hit in performance. Personally, I have seen higher perf on nvlinked GPUs and models that can fit on 2x 3090 24gbs.
Yes, no one seemed to be bragging about much of a performance boost. I think they were happy being able to run models (at any speed) that needed 24GB without having to buy a 3090.
You just copy data over PCIe, however it is slower than nvlink. There are several ways to utilize multiple GPUs, the main contenders as I understand it is pipelined and so-called tensor parallelism. Think of it as slicing a loaf of bread in regular slices vs along its longer axis.
The former can have higher peak throughput while the latter can have lower latency, though it depends on the details[1].
[1]: https://blog.squeezebits.com/vllm-vs-tensorrtllm-9-paralleli...
> You just copy data over PCIe
AFAIK this only happens directly over PCIe if using hacked drivers with p2p enabled (I think tinygrad/tinybox provided these drivers initially.)
Otherwise, data goes through system bus/CPU first. `nvidia-smi topo -m` will show how the GPUs are connected.
1x5090 not enough vram
Does FP4 support really matter if local inferencing is RAM bandwidth limited anyways?
I think we should put more effort into compressing the models well to be able to use local GPUs better.
Well, FP4 (mostly INT4 these days) means you are moving fewer bits per weight around, so the memory bandwidth that you have performs better.
That doesn't make sense to me. Memory bandwidth refers to the throughput of moving data from memory. Even if FP4 calculations aren't natively supported, you still move 4 bits of data per number from memory and then cast it to a FP8 or FP16 or other higher-precision number, right?
Llama.cpp already moves 1.5bit to the cores, they are converted in SRAM which has much higher bandwidth.
FP4 only helps batched inferencing.
The 3090 was always a fallacy without native fp8.
Performance for text generation is memory-limited, so lack of native fp8 support does not matter. You have more than enough compute left over to do the math in whichever floating point format you fancy.
Performance is good enough for non-reasoning models even if they're FP8 or FP4. Check the phoronix article, the difference between the 3090 and 4090 is rather small.
There's weight-only FP8 in vLLM on NVidia Ampere: https://docs.vllm.ai/en/latest/features/quantization/fp8.htm...
How? It's not like Nvidia is some also-ran company for which people did not build custom kernels that combine dequantization and GEMV/GEMM in a single kernel.
What? There are highly optimized Marlin kernels for W8A16 that function very well on a 3090 in both FP8 and INT8 formats.
Sometimes I daydream about a world where GPUs just have the equivalent of LPCAMM and you could put in as much RAM as you can afford and as much as the hardware supports, much like is the case with motherboards, even if something along the way would bottleneck somewhat. It'd really extend the life of some hardware, yet companies don't want that.
That said, it's cool that you can even get an L4 with 24 GB of VRAM that actually performs okay, yet is passively cooled and consumes like 70W, at that point you can throw a bunch of them into a chassis and if you haven't bankrupted yourself by then, they're pretty good.
I did try them out on Scaleway, the pricing isn't even that exorbitant, using consumer GPUs for LLM use cases doesn't quite hit the same since.
SOCAMM will enter production in the end of the year:
https://www.tomshardware.com/pc-components/dram/nvidia-repor...
The CEO of SK Hynix have confirmed that it is in the works:
https://www.mk.co.kr/en/it/11245259
And big companies have started to analyze it, to figure out how it will fit into their domain:
https://x.com/Jukanlosreve/status/1892916771692421228?t=3ikB...
Saw a video from Bolt Graphics, a startup trying to do that among other things. Supposedly they'll have demos at several conventions later this year, like at Hot Chips.
I think it's an incredibly tall order to get into the GPU game for a startup, but it should be good entertainment if nothing else.
https://www.youtube.com/watch?v=8m-gSSIheno
Sounds like hype
> yet companies don't want that
I think there are quite a few constraints you are glossing over here.
Oh, I definitely am! It's always cool when someone with domain specific knowledge drops by and proceeds to shatter that dream with technical reasons that are cool nonetheless, the same way how LPCAMM2 doesn't work with every platform either and how VRAM has pretty stringent requirements.
That said, it's understandable that companies are sticking with whatever works for now, but occasionally you get an immensely cool project that attempts to do something differently, like Intel's Larrabee did, for example.
The benchmark only touches 8B-class models at 8-bit quantification. Would be interesting to see how it fares with models that use more of the card ram, and under varying quantization and context lengths.
I agree. This benchmark should have compared the largest ~4 bit quantized model that fits into VRAM, which would be somewhere around 32B for RTX 3090/4090/5090.
For text generation, which is the most important metric, the tokens per second will scale almost linearly with memory bandwidth (936 GB/s, 1008 GB/s and 1792 GB/s respectively), but we might see more interesting results when comparing prompt processing, speculative decoding with various models, vLLM vs llama.cpp vs TGI, prompt length, context length, text type/programming language (actually makes a difference with speculative decoding), cache quantization and sampling methods. Results should also be checked for correctness (perplexity or some benchmark like HumanEval etc.) to make sure that results are not garbage.
If anyone from Phoronix is reading this, this post might be a good point to get you started: https://old.reddit.com/r/LocalLLaMA/comments/1h5uq43/llamacp...
At time of writing, Qwen2.5-Coder-32B-Instruct-GGUF with one of the smaller variants for speculative decoding is probably the best local model for most programming tasks, but keep an eye out for any new models. They will probably show up in Bartowksi's "Recommended large models" list, which is also a good place to download quantized models: https://huggingface.co/bartowski
Using aider with local models is a very interesting stress case to add on top of this. Because the support for reasoning models is a bit rough, and they aren't always great at sticking to the edit format, what you end up doing is configuring different models for different tasks (what aider calls "architect mode").
I use ollama for this, and I'm getting useful stuff out of qwq:32b as the architect, qwen2.5-coder:32b as the edit model, and dolphin3:8b as the weak model (which gets used for things like commit messages). Now what that means is that performance swapping these models in and out of the card starts to matter, because they don't all go into VRAM at once; but also using a reasoning model means that you need straight-line tokens per second as well, plus well-tuned context length so as not to starve the architect.
I haven't investigated whether a speculative decoding setup would actually help here, I've not come across anyone doing that with a reasoner before now but presumably it would work.
It would be good to see a benchmark based on practical aider workflows. I'm not aware of one but it should be a good all-round stress test of a lot of different performance boundaries.
There's been some INT4/NVFP4 gains too https://hanlab.mit.edu/blog/svdquant-nvfp4 https://blackforestlabs.ai/flux-nvidia-blackwell/
If my budget is only ~$1000 should I buy a used 3090 or a new 5080? (for AI, I don't care about gaming)
3090, the 24 GB of VRAM will be more relevant than running smaller models even faster on lower capacity card (small models already run fast by nature of being less to copy from memory and compute on).
3090. Models like flux or wanvideo with fp16 text encoders will push your VRAM usage over 20GB. LLMs with functional context (or workflows with multi-model, not modal, tie in) will all break 16GB.
There's no point to performance without context. At its pricepoint, the 5080 is purely a gaming card.
Do you think you can get a 5080 for a 1000? In the Netherland they're going for 1600 euro's minimum, more like 1800. So I would totally get a new 5080 for a 1000 dollars.
Correct me if I'm wrong, but I have the impression that we'd usually expect to see bigger efficiency gains while these are marginal?
If so that would confirm the notion that they've hit a ceiling and pushing against physical limitations.
> I have the impression that we'd usually expect to see bigger efficiency gains while these are marginal?
The 50-series is made using the same manufacturing process ("node") as the 40-series, and there is not a major difference in design.
So the 50-series is more like tweaking an engine that previously topped out at 5000 RPM so it's now topping out at 6000 RPM, without changing anything fundamental. Yes it's making more horsepower but it's using more fuel to do so.
It would be very interesting to see these alongside benchmarks for Apple M4, AMD Halo Strix and other AMD cards.
Comparison for previous generations
https://www.hardware-corner.net/guides/gpu-benchmark-large-l...
When running on apple silicon you want to use mlx, not llama.cpp as this benchmark does. Performance is much better than what's plotted there and seems to be getting better, right?
Power consumption is almost 10x smaller for apple.
Vram is more than 10x larger.
Price wise for running same size models apple is cheaper.
Upper limit (larger models, longer context) is far larger for apple (for nvidia you can easily put 2x cards, more than that it becomes whole complex setup no ordinary person can do).
Am I missing something or apple is simply currently better for local llms?
Unless something changed recently, I’m not aware of big perf differences between MLX and llama.cpp on Apple hardware.
I'm under the same impression. Llama.cpp's readme used to start with "Apple Silicon as a First Class Citizen", and IIRC Georgi works on Mac himself.
there is a plateau where you simply need more compute and the m4 cores are not enough, so even if they have enough ram for the model the token/s is not useful
For all models fitting 2x 5090 (2x 32GB) that's not a problem, so you can say if you have this problem then RTX is also not an option.
On apple silicons you can always use MoE models, which work beautifully. On RTX it's kind of waste to be honest to run MoE, you'd be better off running single, whole active model that fills available memory (with enough space for the context).
I'm trying to find out about that as well as I'm considering a local LLM for some heavy prototyping. I don't mind which HW I buy, but it's on a relative budget and energy efficiency is also not a bad thing. Seems the Ultra can do 40 tokens/sec on DeepSeek and nothing even comes close at that price point.
The DeepSeek R1 distilled onto Llama and Qwen base models are also unfortunately called “DeepSeek” by some. Are you sure you’re looking at the right thing?
The OG DeepSeek models are hundreds of GB quantized, nobody is using RTX GPUs to run them anyway…
You are missing something. This is a single stream of inference. You can load up the Nvidia card with at least 16 inference streams and get at much higher throughout tokens/sec.
This just is just a single user chat experience benchmark.
And workstation and server-class cards.
Curious if the author checked whether his card doesn't have any missing ROPs.
Why would that be relevant? this isn't rasterization.
They might be, depending on architecture - the ROPs are responsible for taking the output from the shader core and writing it back to memory, so can be used in compute shaders even if all the fancier "Raster Operation" modes aren't really used there. No point having a second write pipeline to memory when there's already one there. But if your usecase doesn't really pressure that side of things then even if some are "missing" it might make zero difference, and my understanding of most ML models is they're heavily read bandwidth biased.
P40s are so slept on. 24gb vram for $150.
I think if they had more than 24gb there would be more interest but the hacks and noise with trying to make server hardware work within a desktop case and the FP16 performance is extremely poor compared to a 3090 (183.7 GFLOPS vs 35.58 TFLOPS IIRC).
when you see a deal which is too good to be true, it's probably too good to be true!
It's not too good to be true, it just takes a lot of power to run old GPUs to get slow token speeds!
What are the factors that make the P40 too good to be true?
Convenience, support, and performance.
Ampere is the floor, in practice, which is what effectively makes the "buy a Mac Studio" crowd P40 people with 10x the budget.
P40s are no longer $150. On eBay, they're at least $250 from China, or $400 for US sellers
And the latest AMD cards for reference?
Also, some DeepSeek models would be cool.
Coming I hope, though I wouldn't have huge expectations.
He already did the general compute benchmark of the two 9070 cards here[1], and between poorly optimized drivers and GDDR6's lower memory bandwidth, I wouldn't expect any great scores.
In terms of memory bandwidth they're between a 4070 and a 4070 Ti SUPER, and given that LLMs are very memory-bandwidth constrained as I mentioned in another comment, at best you'd expect the LLM score to end up between the 4070 and the 4070 Ti SUPER.
[1]: https://www.phoronix.com/review/amd-radeon-rx9070-linux-comp...
The 9070 cards are hard to find, and inflated 30% above MSRP right now. Apparently the supply of fuse-crippled 9070 cards was just used to up-sell 9070 XT (easier to find.)
Have a look at the GamersNexus YT rant about it... They make a fair argument in that price range an older model used RTX nvidia card may be a better value.
Depends on your use-case, as rtx 5090 nvidia AI frame interpolation is dog crap hype for CGI or CUDA accelerated ML libraries.
Personally, I would go with the rtx 4090 or even an rtx 3090 with 24G vram for ML and CGI workstation, as CUDA+Optix has better software support. For just gaming, the 9070 XT is a better deal when the MSRP is within range. Depends how willing you are to get ripped off by scalper prices right now. lol =3
Is llama.cpp's CUDA implementation decent? (e.g. does it use CUTLASS properly or something more low level)
The implementation's here: https://github.com/ggml-org/llama.cpp/tree/master/ggml/src/g...
Great to see Mr Larabel@Phoronix both maintaining consistent legit reporting and still have time for one-offs like this in these times of AI slop and other OG writers either quitting or succumbing to the vortex. Hats off!
TL;DR; performance isn't bad, but perf per Watt isn't better than 4080 or 4090 and can even be significantly lower than 4090 in certain contexts.
I think that's underselling it. Performance is good, up by a significant margin, and the VRAM boost is well worth it. There's just no efficiency gain to go along with it.
It looks to me like you could think about it as a performance/VRAM/convenience stepping-stone between having one 4090 and having a pair.
Paired 5090s, if such a thing is possible, sounds like a very good way to spend a lot of money very quickly while possibly setting things on fire, and you'd have to have a good reason for that.
They didn't try power limiting the cards.
[dead]