About what I expected. The Jetson series had the same issues, mostly, at a smaller scale: Deviate from the anointed versions of YOLO, and nothing runs without a lot of hacking. Being beholden to CUDA is both a blessing and a curse, but what I really fear is how long it will take for this to become an unsupported golden brick.
Also, the other reviews I’ve seen point out that inference speed is slower than a 5090 (or on par with a 4090 with some tailwind), so the big difference here (other than core counts) is the large chunk of “unified” memory. Still seems like a tricky investment in an age where a Mac will outlive everything else you care to put on a desk and AMD has semi-viable APUs with equivalent memory architectures (even if RoCm is… well… not all there yet).
Curious to compare this with cloud-based GPU costs, or (if you really want on-prem and fully private) the returns from a more conventional rig.
Despite the large video memory capacity, its video memory bandwidth is very low. I guess the model's decode speed will be very slow. Of course, this design is very well suited for the inference needs of MoE models.
Is 128 GB of unified memory enough? I've found that the smaller models are great as a toy but useless for anything realistic. Will 128 GB hold any model that you can do actual work with or query for answers that returns useful information?
There are several 70B+ models that are genuinely useful these days.
I'm looking forward to GLM 4.6 Air - I expect that one should be pretty excellent, based on experiments with a quantized version of its predecessor on my Mac. https://simonwillison.net/2025/Jul/29/space-invaders/
the question is: how does the prompt processing time on this compare to M3 Ultra because that one sucks at RAG even though it can technically handle huge models and long contexts...
I'm hopeful this makes Nvidia take aarch64 seriously for Jetson development. For the past several years Mac-based developers have had to run the flashing tools in unsupported ways, in virtual machines with strange QEMU options.
And yet CUDA has looked way better than ATi/AMD offerings in the same area despite ATi/AMD technically being first to deliver GPGPU (major difference is that CUDA arrived year later but supported everything from G80 up, and nicely evolved, while AMD managed to have multiple platforms with patchy support and total rewrites in between)
Except the performance people are seeing is way below expectations. It seems to be slower than an M4. Which kind of defeats the purpose. It was advertised as 1 Petaflop on your desk.
But maybe this will change? Software issues somehow?
About what I expected. The Jetson series had the same issues, mostly, at a smaller scale: Deviate from the anointed versions of YOLO, and nothing runs without a lot of hacking. Being beholden to CUDA is both a blessing and a curse, but what I really fear is how long it will take for this to become an unsupported golden brick.
Also, the other reviews I’ve seen point out that inference speed is slower than a 5090 (or on par with a 4090 with some tailwind), so the big difference here (other than core counts) is the large chunk of “unified” memory. Still seems like a tricky investment in an age where a Mac will outlive everything else you care to put on a desk and AMD has semi-viable APUs with equivalent memory architectures (even if RoCm is… well… not all there yet).
Curious to compare this with cloud-based GPU costs, or (if you really want on-prem and fully private) the returns from a more conventional rig.
It's notable how much easier it is to get things working now that the embargo has lifted and other projects have shared their integrations.
I'm running VLLM on it now and it was as simple as:
(That recipe from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?v... )And then in the Docker container:
The default model it loads is Qwen/Qwen3-0.6B, which is tiny and fast to load.I'm curious, does its architecture support all CUDA features out of the box or is it limited compared to 5090/6000 Blackwell?
Despite the large video memory capacity, its video memory bandwidth is very low. I guess the model's decode speed will be very slow. Of course, this design is very well suited for the inference needs of MoE models.
Is 128 GB of unified memory enough? I've found that the smaller models are great as a toy but useless for anything realistic. Will 128 GB hold any model that you can do actual work with or query for answers that returns useful information?
There are several 70B+ models that are genuinely useful these days.
I'm looking forward to GLM 4.6 Air - I expect that one should be pretty excellent, based on experiments with a quantized version of its predecessor on my Mac. https://simonwillison.net/2025/Jul/29/space-invaders/
the question is: how does the prompt processing time on this compare to M3 Ultra because that one sucks at RAG even though it can technically handle huge models and long contexts...
I wonder how this compares financially with renting something on the cloud.
This seems to be missing the obligatory pelican on a bicycle.
Here's one I made with it - I didn't include it in the blog post because I had so many experiments running that I lost track of which model I'd used to create it! https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...
That seat post looks fairly unpleasant.
Whole thing feels like a paper launch being held up by people looking for blog traffic missing the point.
I'd be pissed if I paid this much for hardware and the performance was this lacklustre while also being kneecapped for training
When the networking is 25GB/s and the memory bandwidth is 210GB/s you know something is seriously wrong.
I'm hopeful this makes Nvidia take aarch64 seriously for Jetson development. For the past several years Mac-based developers have had to run the flashing tools in unsupported ways, in virtual machines with strange QEMU options.
> even in a Docker container
I should be allowed to do stupid things when I want. Give me an override!
A couple of people have since tipped me off that this works around that:
You can run that as root and Claude won't complain.The reported 119GB vs. 128GB according to spec is because 128GB (1e9 bytes) equals 119GiB (2^30 bytes).
That can't be right because RAM has always been reported in binary units. Only storage and networking use lame decimal units.
Looks like Claude reported it based on this:
That 119Gi is indeed gibibytes, and 119Gi in GB is 128GB.Ugh, that one gets me every time!
As is usual for NVidia: great hardware, an effing nightmare figuring out how to setup the pile of crap they call software.
If you think their software is bad try using any other vendor , makes nvidia looks amazing. Apple is only one close
Although a bit off the GPU topic, I think Apple's Rosetta is the smoothest binary transition I've ever used.
Try to use Intel or AMD stuff instead.
And yet CUDA has looked way better than ATi/AMD offerings in the same area despite ATi/AMD technically being first to deliver GPGPU (major difference is that CUDA arrived year later but supported everything from G80 up, and nicely evolved, while AMD managed to have multiple platforms with patchy support and total rewrites in between)
What was the AMD GPGPU called?
Except the performance people are seeing is way below expectations. It seems to be slower than an M4. Which kind of defeats the purpose. It was advertised as 1 Petaflop on your desk.
But maybe this will change? Software issues somehow?
It also runs CUDA, which is useful
it fits bigger models and you can stack them.
plus apparently some of the early benchmarks were made with ollama and should be disregarded
More discussion: https://news.ycombinator.com/item?id=45575127