Finally, in this post, we put our money where our mouth is. We dragged a 2016 enterprise relic out of the closet — NAY, out of the grave, a single Intel Xeon running on agonizingly slow DDR3 RAM with absolutely no GPU to speak of — and forced it to run a cutting-edge, 26-billion-parameter Mixture-of-Experts architecture at reading speed. We did without throwing exotic hardware at the problem. Instead we treated the deployment pipeline as a serious thing, and mapped the architecture directly to physical hardware, tuning memory allocation, and unlocking the absolute limits of CPU cache optimization.
A 10 year old Xeon is all you need - point.free
Or running Gemma 4 on a 2016 Xeon with no GPU, 25 flags, 128 GB of DDR3, and a 25B-parameter MoE.

Very cool post. I'm may go necro up some old servers and mess with brining them back up to speed now.

I wonder if, by pooling resources, something like this could start within the old IT-guard communities, spread, and begin to tip the scales between homespun infrastructure and membership-based services.

How does that then affect the open model ecosystem? From my understanding we still need the data center footprint at that point, but the last mile ( you typing something in ) could* be converted in maybe 50% of use cases?