This technical blog post about LLaMA CPU optimization emphasizes computational efficiency and implicit technology access expansion. Page content is severely truncated (fetch incomplete); evaluation based on visible title and open structural access. The work demonstrates mild positive positioning on Articles 2, 19, and 27 through removal of hardware barriers and open knowledge sharing, without explicit human rights framing.
I think it's a good idea for everyone to download and be able to run a LLM locally, even if you have the minimum of requirements. As a pseudo-backup of a large chunk of human knowledge.
Super nice story on the matmul optimization that gave 810 gflops for 512x512. Thanks for the write up and the contributions to llama.cpp and the community more broadly.
It fascinating to me that coming up on a year since Sapphire Rapids has been available in the public cloud, developers are still targeting AVX512 when they should be targeting VNNI and AMX.
A way of thinking about what's inside any of the top LLMs right now: even if they never learn another single fact, even if they get ridiculously out of date as a result, even if they are even more riddled with errors and prone to biases than we know them to be, even if they are as prone to hallucinations as we know they they are and they never develop the capacity to cure themselves of this, they are more knowledgeable and capable of more reasoned response, despite their capacity for error, to more questions than any single human being that has ever lived.
> You don't need a large computer to run a large language model
While running tiny llama does indeed count as running a language model, I’m skeptical that the capabilities of doing so match what most people would consider a baseline requirement to be useful.
Running 10 param model is also “technically” running an LM, and I can do it by hand with a piece of paper.
That doesn’t mean “you don’t need a computer to run an LM”…
I’m not sure where LM becomes LLM, but… I personally think it’s more about capability than parameter count.
I don’t realllly believe you can do a lot of useful LLM work on a pi
That's interesting because I built a simple ANN library and I was playing around with GPU acceleration and came to a similar conclusion as this article.
To be fair, my ANN library was faster (up to 2x) with GPU acceleration in some scenarios were ANN was shallow (as opposed to deep with many hidden layers). I thought the marginal gain may have been because, the way it's set up in my library, it has to load all the values into the GPU from RAM for each pass of forward and back propagation in each layer during training. I believe there is a way to allocate memory on the GPU chip itself but it's a lot more challenging to do, especially in a modular, fully portable way (which was one of the goals of my library).
But anyway, even the 2x best-case figure seemed disappointing. In my mind, I expected to see at least 10x speed improvement... And I was surprised that the CPU version was actually slightly faster in the scenario I was testing at the time which was a relatively deep network. It makes sense since the different layers cannot be parallelized as the input of one layer depends on the output of the previous layer... So the more layers you have, the more serial bottlenecks you have, the less you can benefit from GPU acceleration... And unfortunately, deep networks also happen to be those which tend to perform best for a lot of use cases.
Nice to see such speedups for CPUs. Are these changes available as a branch or pull request in llama.cpp itself? I'd like to make use of them in that form if possible (as I'm used to using that).
This is great work. I've always thought it would be great if running LLM could be commoditized for regular average Joe hardware. I had thought that llamafile was like dockerfile for llama.cpp but looks like that's a mistake?
> I learned how to write math kernels by renting Vast VMs and watching Gautham Venkatasubramanian and mrdomino develop CUDA kernels in a tmux session. They've been focusing on solving a much more important challenge for llamafile, which is helping it not have a mandatory dependency on the cuBLAS
If I'm reading this right, they're trying to rewrite cuBLAS within CUDA itself. I'm guessing the next step would be removing CUDA dependency and go with directly using Vulkan or Metal compute shaders. Am I correct?
"Here we see that, despite only being twice the price, the 7995WX x86 ISA offers 7x more raw compute power than the M2 Ultra ARM ISA, and nearly the same token generation speed, which is likely thanks to its 384mb L3 cache. When I bought this chip, I had to expand support in llama.cpp for bfloat16 and AVX512 before I could fully test its capabilities. My work means you can now run LLaMA 2.8x faster on Zen4 than you could before."
There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags.
The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.
Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.
Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.
Strange title. My first read of the title thought the author was arguing the model is now faster on CPU than GPU. Would be much nicer if they titled this something closer to "Performance Improvement for LLaMa on CPU".
As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement.
But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.
> I like to define my subroutines using a modern language like C++, which goes 47 gigaflops. This means C++ is three orders of a magnitude faster than Python. That's twenty years of progress per Moore's law.
This is great. I love the idea of measuring performance differences in “years of Moore’s law.”
Twenty years puts the delta in an easy to understand framework.
> One important thing to know if you're considering buying a Mac Studio is that, like the Windows Executive, XNU does a really good job keeping your desktop stable, and that means protecting your system from you. It takes me 45 seconds on Mac Studio to compile the Cosmo monorepo, due to all these safety features; but if I fork bombed it, I'd be surprised if Netflix skipped a single frame.
Clearly nobody actually tried this, because on XNU if you fork bomb the system it reliably goes down every single time. There are no "safety features" here but extra overhead when spawning processes.
I contend that most human knowledge is not written down or if it is written down it’s not publicly available on the internet and so does not exist in these datasets.
There’s so much subtle knowledge like the way a mother learns to calm her child or the way a carpenter learns to work different kinds of wood which may be written down in part, but may also be learned through lived experience or transferred from human to human such that little of it gets written down and posted online.
If you want to download a backup of a large chunk of human knowledge... download wikipedia. It's a similar size to a small LLM and can actually distinguish between real life and fantasy: https://en.wikipedia.org/wiki/Wikipedia:Database_download
If you just want to play around with an LLM though, absolutely.
This project in particular seems to care about the long tail of hardware; note that the very first machine in this post is a box from 2020 with spinning rust disk. Granted, adding support for newer extensions is likely also good, but cost/benefit is in play.
I strongly recommend that people run LLMs locally for a different reason.
The ones you can run on your own machine tend to be bad - really bad. They hallucinate wildly and fail at all sorts of tasks that the larger hosted ones succeed at.
This makes them a fantastic tool for learning more about how LLMs work and what they're useful for. Interacting with a weak-but-functional LLM that runs on your own computer is a great way to get a much more solid mental model for what these things actually are.
>> I believe the trick with CPU math kernels is exploiting instruction level parallelism with fewer memory references
It's the collection of tricks to minimize all sort of cache misses (L1, L2, TLB, page miss etc), improve register reuse, leverage SIMD instructions, transpose one of the matrices if it provides better spatial locality, etc.
Maybe I'm seeing things through a modern lens, but if I were trying to restart civilization and was only left with ChatGPT, I would be enraged and very much not grateful for this.
Tinyllama isn't going to be doing what ChatGPT does, but it still beats the pants off what we had for completion or sentiment analysis 5 years ago. And now a Pi can run it decently fast.
I don't believe that is the target for a local LLM... Pretty sure we're talking about client-side computing, of which the newest supports only AVX-512 (and even that sketchily on Intel's side).
Some newer models trained more recently have been repeatedly shown to have comparable performance as larger models. And the Mixture of Experts architecture makes it possible to train large models that know how to selectively activate only the parts that are relevant for the current context, which drastically reduces compute demand. Smaller models can also level the playing field by being faster to process content retrieved by RAG. Via the same mechanism, they could also access larger, more powerful models for tasks that exceed their capabilities.
We shouldn't choose LLMs for how many facts they support, but their capability to process human language. There is some overlap between these two though, but an LLM that just doesn't know something can always be augmented with RAG capabilities.
Yes, this is really a phenomenal effort! And what open source is about: Bringing improvements to so many use cases. So that Intel and AMD chip uses can start to perform while taking advantage of their high-performance capabilities, making even old parts competitive.
Yes, but none of these have performance portability across GPU vendors, so it's probably seen as pointless. You would need an AMD Vulkan shader, an nvidia one, and intel one, etc. It's not like C code on CPUs.
llama.cpp (or rather G.Gerganov et. al.) are trying to avoid cuBLAS entirely, using ins own kernels. not sure how jart's effort relates, and whether jart intends to upstream these into llama.cpp which seems to still be the underlying tech behind the llamafile.
I couldn't disagree more, turning temp to zero is like taking a monte carlo method and only using one sample, or a particle filter with only one particle. Takes the entire concept and throws it out of the window so you can have predictability.
LLMs need to probabilistically explore the generation domain to converge on a good result for best performance. Similar issue with people benchmarking models by only having them output one single token (e.g. yes or no) outright, which prevents any real computation from occurring so the results are predictably poor.
I had downloaded some LLMs to run locally just to experiment when a freak hailstorm suddenly left me without internet for over a week. It was really interesting to use a local LLM as a replacement for Google.
It gave me a new mental model for LLMs rather than a "spicy autocomplete" or whatever, I now think of it as "a lossy compressed database of knowledge". Like you ran the internet through JPEG at 30% quality.
I'm not sure how true that is anymore, from the outside it seems they're at least moving to a CPU/GPU hybrid (which makes a lot of sense), at least judging by new features landing in RenderMan that continues to add more support for GPUs (like XPU).
The Fortran implementation is just a reference implementation. The goal of reference BLAS [0] is to provide relatively simple and easy to understand implementations which demonstrate the interface and are intended to give correct results to test against. Perhaps an exceptional Fortran compiler which doesn't yet exist could generate code which rivals hand (or automatically) tuned optimized BLAS libraries like OpenBLAS [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in practice this is not observed.
Justine observed that the threading model for LLaMA makes it impractical to integrate one of these optimized BLAS libraries, so she wrote her own hand-tuned implementations following the same principles they use.
Modern Fortran's only parallel feature is coarrays, which operate at the whole program level.
DO CONCURRENT is a serial construct with an unspecified order of iterations, not a parallel construct. A DO CONCURRENT loop imposes requirements that allow an arbitrary order of iterations but which are not sufficient for safe parallelization.
using AVX/FMA and unrolling loops does extremely little in the way of compiling to fast (>80% peak) GEMM code. These are very much intro steps that don't take into account many important ideas related to cache hierarchy, uop interactions, and even instruction decode time. The Fortran implementation is entirely and unquestionably inadequate for real high performance GEMMs.
Technical optimization of computational science expands participation in scientific advancement by making LLaMA accessible to broader population with diverse hardware capabilities.
FW Ratio: 50%
Observable Facts
Content directly addresses optimization of LLaMA, a computational science technology.
Knowledge is published openly for anyone to access and apply to scientific work.
Inferences
Making AI technology more efficient removes hardware barriers to scientific participation.
Sharing optimization knowledge enables broader engagement with technological advancement.
build 1ad9551+j7zs · deployed 2026-03-02 09:09 UTC · evaluated 2026-03-02 13:57:54 UTC
Support HN HRCB
Each evaluation uses real API credits. HN HRCB runs on donations — no ads, no paywalls.
If you find it useful, please consider helping keep it running.