+0.13 LLaMA now goes faster on CPUs

Name: HRCB Evaluation: LLaMA now goes faster on CPUs
Item: LLaMA now goes faster on CPUs
Rating: 0.147
Author: HN HRCB

Model: @cf/meta/llama-3.3-70b-instruct-fp8-fast lite 0.00 claude-haiku-4-5-20251001 +0.13 deepseek/deepseek-v3.2-20251201 +0.15 @cf/meta/llama-4-scout-17b-16e-instruct lite 0.00 claude-haiku-4-5 lite 0.00 meta-llama/llama-3.3-70b-instruct:free lite ND Compare

+0.13	LLaMA now goes faster on CPUs (justine.lol S:+0.17 )
	1372 points by lawrencechen 700 days ago \| 451 comments on HN \| Mild positive Editorial · v3.7 · 2026-02-28 07:26:01 0

Summary Digital Access & Technology Acknowledges

This technical blog post about LLaMA CPU optimization emphasizes computational efficiency and implicit technology access expansion. Page content is severely truncated (fetch incomplete); evaluation based on visible title and open structural access. The work demonstrates mild positive positioning on Articles 2, 19, and 27 through removal of hardware barriers and open knowledge sharing, without explicit human rights framing.

Article Heatmap

Negative Neutral Positive No Data

Aggregates

Editorial Mean	+0.13	Structural Mean	+0.17
Weighted Mean	+0.15	Unweighted Mean	+0.15
Max	+0.18 Article 27	Min	+0.12 Article 2
Signal	3	No Data	28
Volatility	0.03 (Low)
Negative	0	Channels	E: 0.6 S: 0.4
SETL ℹ	-0.04	Structural-dominant
FW Ratio ℹ	50%	6 facts · 6 inferences

Evidence 2% coverage ℹ

   3L  28 ND 

Theme Radar

HN Discussion 20 top-level · 30 replies

bottlepalm 2024-04-01 02:52 UTC link

I think it's a good idea for everyone to download and be able to run a LLM locally, even if you have the minimum of requirements. As a pseudo-backup of a large chunk of human knowledge.

pama 2024-04-01 03:04 UTC link

Super nice story on the matmul optimization that gave 810 gflops for 512x512. Thanks for the write up and the contributions to llama.cpp and the community more broadly.

kiratp 2024-04-01 03:08 UTC link

It fascinating to me that coming up on a year since Sapphire Rapids has been available in the public cloud, developers are still targeting AVX512 when they should be targeting VNNI and AMX.

https://github.com/ggerganov/llama.cpp/issues/2555

aniijbod 2024-04-01 03:17 UTC link

A way of thinking about what's inside any of the top LLMs right now: even if they never learn another single fact, even if they get ridiculously out of date as a result, even if they are even more riddled with errors and prone to biases than we know them to be, even if they are as prone to hallucinations as we know they they are and they never develop the capacity to cure themselves of this, they are more knowledgeable and capable of more reasoned response, despite their capacity for error, to more questions than any single human being that has ever lived.

ajtulloch 2024-04-01 03:32 UTC link

- https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html (e.g. here https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-blocki...)

- https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...

might be of interest

wokwokwok 2024-04-01 04:05 UTC link

> You don't need a large computer to run a large language model

While running tiny llama does indeed count as running a language model, I’m skeptical that the capabilities of doing so match what most people would consider a baseline requirement to be useful.

Running 10 param model is also “technically” running an LM, and I can do it by hand with a piece of paper.

That doesn’t mean “you don’t need a computer to run an LM”…

I’m not sure where LM becomes LLM, but… I personally think it’s more about capability than parameter count.

I don’t realllly believe you can do a lot of useful LLM work on a pi

none_to_remain 2024-04-01 04:24 UTC link

From the example: "--temp 0 turns off the random number generator (we don't want improvisation for a spam filter)"

I've been thinking for a while about how many applications of LLMs need this adjustment and aren't getting it

jongjong 2024-04-01 05:18 UTC link

That's interesting because I built a simple ANN library and I was playing around with GPU acceleration and came to a similar conclusion as this article.

To be fair, my ANN library was faster (up to 2x) with GPU acceleration in some scenarios were ANN was shallow (as opposed to deep with many hidden layers). I thought the marginal gain may have been because, the way it's set up in my library, it has to load all the values into the GPU from RAM for each pass of forward and back propagation in each layer during training. I believe there is a way to allocate memory on the GPU chip itself but it's a lot more challenging to do, especially in a modular, fully portable way (which was one of the goals of my library).

But anyway, even the 2x best-case figure seemed disappointing. In my mind, I expected to see at least 10x speed improvement... And I was surprised that the CPU version was actually slightly faster in the scenario I was testing at the time which was a relatively deep network. It makes sense since the different layers cannot be parallelized as the input of one layer depends on the output of the previous layer... So the more layers you have, the more serial bottlenecks you have, the less you can benefit from GPU acceleration... And unfortunately, deep networks also happen to be those which tend to perform best for a lot of use cases.

kristianp 2024-04-01 05:50 UTC link

Nice to see such speedups for CPUs. Are these changes available as a branch or pull request in llama.cpp itself? I'd like to make use of them in that form if possible (as I'm used to using that).

politelemon 2024-04-01 06:02 UTC link

This is great work. I've always thought it would be great if running LLM could be commoditized for regular average Joe hardware. I had thought that llamafile was like dockerfile for llama.cpp but looks like that's a mistake?

Will definitely be giving this a try.

speps 2024-04-01 08:00 UTC link

Regarding this bit at the end:

> I learned how to write math kernels by renting Vast VMs and watching Gautham Venkatasubramanian and mrdomino develop CUDA kernels in a tmux session. They've been focusing on solving a much more important challenge for llamafile, which is helping it not have a mandatory dependency on the cuBLAS

If I'm reading this right, they're trying to rewrite cuBLAS within CUDA itself. I'm guessing the next step would be removing CUDA dependency and go with directly using Vulkan or Metal compute shaders. Am I correct?

AbuAssar 2024-04-01 12:05 UTC link

regarding AMD zen4 with avx512:

"Here we see that, despite only being twice the price, the 7995WX x86 ISA offers 7x more raw compute power than the M2 Ultra ARM ISA, and nearly the same token generation speed, which is likely thanks to its 384mb L3 cache. When I bought this chip, I had to expand support in llama.cpp for bfloat16 and AVX512 before I could fully test its capabilities. My work means you can now run LLaMA 2.8x faster on Zen4 than you could before."

miki123211 2024-04-01 12:23 UTC link

If I'm reading the post correctly, Llamafile is faster than llama.cpp, despite the author upstreaming some of the changes. What's the reason for this?

tiffanyh 2024-04-01 12:29 UTC link

Pixar uses CPUs …

I wonder if we’ll end up in a situation like rendered movies.

Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).

https://news.ycombinator.com/item?id=25616372

s_Hogg 2024-04-01 13:02 UTC link

I'd pay good money to watch jart in conversation with Carmack

marshallward 2024-04-01 14:05 UTC link

There is an implication here that the Fortran implementation of `SGEMM` is somehow inadequate. But any modern Fortran compiler will quite easily apply the AVX and FMA optimizations presented here without any additional changes. Both GNU and Intel make these substitutions with the correct flags.

The unrolling optimization is also just another flag away (`-funroll-all-loops`). The Intel Compiler will even do this without prompting. In fact, it appears to only do a modest 2x unroll on my machine, suggesting that the extreme unroll in this article would have been overkill.

Parallelization certainly a lot to ask of Fortran 77 source, but there there is little stopping you from adding OpenMP statements to the `SGEMM` function. In fact, modern Fortran even offers its own parallelization constructs if you're willing to go there.

Which is to say: Let's not belittle this old Fortran 77 function. Yes it is old, and does not even resemble modern Fortran. But the whole point of Fortran is to free the developer from these platform-specific details, and hand the job off to the compiler. If you don't like that approach, then you're welcome to go to C or C++. But this little block of Fortran code is already capable of doing just about everything in this article.

TimPC 2024-04-01 16:49 UTC link

Strange title. My first read of the title thought the author was arguing the model is now faster on CPU than GPU. Would be much nicer if they titled this something closer to "Performance Improvement for LLaMa on CPU".

ein0p 2024-04-01 17:07 UTC link

As someone who has tried to beat MKL-DNN, and was unsuccessful at doing so even for constrained matrix sizes, I’m curious how they pulled off such a massive improvement.

But as someone who routinely estimates picojoules per flop at $DAY_JOB - there’s simply no way this is energy efficient. That is not even physically possible with a CPU.

aaronscott 2024-04-01 17:10 UTC link

> I like to define my subroutines using a modern language like C++, which goes 47 gigaflops. This means C++ is three orders of a magnitude faster than Python. That's twenty years of progress per Moore's law.

This is great. I love the idea of measuring performance differences in “years of Moore’s law.”

Twenty years puts the delta in an easy to understand framework.

saagarjha 2024-04-03 02:04 UTC link

> One important thing to know if you're considering buying a Mac Studio is that, like the Windows Executive, XNU does a really good job keeping your desktop stable, and that means protecting your system from you. It takes me 45 seconds on Mac Studio to compile the Cosmo monorepo, due to all these safety features; but if I fork bombed it, I'd be surprised if Netflix skipped a single frame.

Clearly nobody actually tried this, because on XNU if you fork bomb the system it reliably goes down every single time. There are no "safety features" here but extra overhead when spawning processes.

TaylorAlexander 2024-04-01 02:55 UTC link

I contend that most human knowledge is not written down or if it is written down it’s not publicly available on the internet and so does not exist in these datasets.

There’s so much subtle knowledge like the way a mother learns to calm her child or the way a carpenter learns to work different kinds of wood which may be written down in part, but may also be learned through lived experience or transferred from human to human such that little of it gets written down and posted online.

gpm 2024-04-01 03:16 UTC link

If you want to download a backup of a large chunk of human knowledge... download wikipedia. It's a similar size to a small LLM and can actually distinguish between real life and fantasy: https://en.wikipedia.org/wiki/Wikipedia:Database_download

If you just want to play around with an LLM though, absolutely.

JKCalhoun 2024-04-01 03:18 UTC link

Picturing "LLM Jeopardy". You know, a game show.

yjftsjthsd-h 2024-04-01 03:23 UTC link

This project in particular seems to care about the long tail of hardware; note that the very first machine in this post is a box from 2020 with spinning rust disk. Granted, adding support for newer extensions is likely also good, but cost/benefit is in play.

simonw 2024-04-01 03:23 UTC link

I strongly recommend that people run LLMs locally for a different reason.

The ones you can run on your own machine tend to be bad - really bad. They hallucinate wildly and fail at all sorts of tasks that the larger hosted ones succeed at.

This makes them a fantastic tool for learning more about how LLMs work and what they're useful for. Interacting with a weak-but-functional LLM that runs on your own computer is a great way to get a much more solid mental model for what these things actually are.

kpw94 2024-04-01 03:57 UTC link

Great links, especially last one referencing the Goto paper:

https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/...

>> I believe the trick with CPU math kernels is exploiting instruction level parallelism with fewer memory references

It's the collection of tricks to minimize all sort of cache misses (L1, L2, TLB, page miss etc), improve register reuse, leverage SIMD instructions, transpose one of the matrices if it provides better spatial locality, etc.

creatonez 2024-04-01 04:06 UTC link

Maybe I'm seeing things through a modern lens, but if I were trying to restart civilization and was only left with ChatGPT, I would be enraged and very much not grateful for this.

mlyle 2024-04-01 04:12 UTC link

Tinyllama isn't going to be doing what ChatGPT does, but it still beats the pants off what we had for completion or sentiment analysis 5 years ago. And now a Pi can run it decently fast.

luyu_wu 2024-04-01 04:18 UTC link

I don't believe that is the target for a local LLM... Pretty sure we're talking about client-side computing, of which the newest supports only AVX-512 (and even that sketchily on Intel's side).

mvkel 2024-04-01 04:40 UTC link

Is that what it does, though?

I thought setting temperature to 0 would (extremely simple example) equate to a spam filter seeing:

- this is a spam email

But if the sender adapts and says

- th1s is a spam email

It wouldn't be flagged as spam.

kristianp 2024-04-01 05:36 UTC link

Just buy a new AMD processor that supports AVX512.

samus 2024-04-01 07:35 UTC link

Some newer models trained more recently have been repeatedly shown to have comparable performance as larger models. And the Mixture of Experts architecture makes it possible to train large models that know how to selectively activate only the parts that are relevant for the current context, which drastically reduces compute demand. Smaller models can also level the playing field by being faster to process content retrieved by RAG. Via the same mechanism, they could also access larger, more powerful models for tasks that exceed their capabilities.

baq 2024-04-01 07:54 UTC link

People with Sapphire Rapids options are not the target audience of these patches

samus 2024-04-01 08:07 UTC link

We shouldn't choose LLMs for how many facts they support, but their capability to process human language. There is some overlap between these two though, but an LLM that just doesn't know something can always be augmented with RAG capabilities.

dagaci 2024-04-01 08:09 UTC link

Yes, this is really a phenomenal effort! And what open source is about: Bringing improvements to so many use cases. So that Intel and AMD chip uses can start to perform while taking advantage of their high-performance capabilities, making even old parts competitive.

There are two PRs raised to merge to llama.cpp:

https://github.com/ggerganov/llama.cpp/pull/6414

https://github.com/ggerganov/llama.cpp/pull/6412

Hopefully these can be accepted, without drama! as there are many downstream dependencies on llama.cpp can will also benefit.

Though of course everyone should also look directly at releases from llamafile https://github.com/mozilla-Ocho/llamafile.

WithinReason 2024-04-01 08:37 UTC link

Yes, but none of these have performance portability across GPU vendors, so it's probably seen as pointless. You would need an AMD Vulkan shader, an nvidia one, and intel one, etc. It's not like C code on CPUs.

larodi 2024-04-01 08:41 UTC link

llama.cpp (or rather G.Gerganov et. al.) are trying to avoid cuBLAS entirely, using ins own kernels. not sure how jart's effort relates, and whether jart intends to upstream these into llama.cpp which seems to still be the underlying tech behind the llamafile.

SoothingSorbet 2024-04-01 09:20 UTC link

I've gotten some useful stuff out of 7B param LLMs, and that should fit on a Pi quantized.

moffkalast 2024-04-01 09:34 UTC link

I couldn't disagree more, turning temp to zero is like taking a monte carlo method and only using one sample, or a particle filter with only one particle. Takes the entire concept and throws it out of the window so you can have predictability.

LLMs need to probabilistically explore the generation domain to converge on a good result for best performance. Similar issue with people benchmarking models by only having them output one single token (e.g. yes or no) outright, which prevents any real computation from occurring so the results are predictably poor.

kalleboo 2024-04-01 10:25 UTC link

I had downloaded some LLMs to run locally just to experiment when a freak hailstorm suddenly left me without internet for over a week. It was really interesting to use a local LLM as a replacement for Google.

It gave me a new mental model for LLMs rather than a "spicy autocomplete" or whatever, I now think of it as "a lossy compressed database of knowledge". Like you ran the internet through JPEG at 30% quality.

reckless 2024-04-01 12:41 UTC link

Does this also count platform costs or just chip cost? I'd imagine the threadripper motherboard and ram costs aren't insignificant

kreco 2024-04-01 12:54 UTC link

> Where the big studios like Pixar uses CPUs (not GPUs) to render their movies due to the cost/perf (and access to larger amounts of RAM).

I wonder if (or when) this will change once integrated GPUs become "mainstream", the CPU/GPU share the same RAM AFAIK.

CaptainOfCoit 2024-04-01 12:56 UTC link

I'm not sure how true that is anymore, from the outside it seems they're at least moving to a CPU/GPU hybrid (which makes a lot of sense), at least judging by new features landing in RenderMan that continues to add more support for GPUs (like XPU).

Solvency 2024-04-01 13:54 UTC link

Carmack is great but completely irrelevant here. He missed the entire AI/LLM/ML boat to help Zuckerberg hawk virtual reality fantasies for years.

steppi 2024-04-01 14:41 UTC link

The Fortran implementation is just a reference implementation. The goal of reference BLAS [0] is to provide relatively simple and easy to understand implementations which demonstrate the interface and are intended to give correct results to test against. Perhaps an exceptional Fortran compiler which doesn't yet exist could generate code which rivals hand (or automatically) tuned optimized BLAS libraries like OpenBLAS [1], MKL [2], ATLAS [3], and those based on BLIS [4], but in practice this is not observed.

Justine observed that the threading model for LLaMA makes it impractical to integrate one of these optimized BLAS libraries, so she wrote her own hand-tuned implementations following the same principles they use.

[0] https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprogra...

[1] https://github.com/OpenMathLib/OpenBLAS

[2] https://www.intel.com/content/www/us/en/developer/tools/onea...

[3] https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Alg...

[4]https://en.wikipedia.org/wiki/BLIS_(software)

pklausler 2024-04-01 15:13 UTC link

Modern Fortran's only parallel feature is coarrays, which operate at the whole program level.

DO CONCURRENT is a serial construct with an unspecified order of iterations, not a parallel construct. A DO CONCURRENT loop imposes requirements that allow an arbitrary order of iterations but which are not sufficient for safe parallelization.

talldayo 2024-04-01 16:41 UTC link

If you ignore my capacity for error, I bet I'd put up a good score too. Hell, maybe Markov chains are smarter than LLMs by this definition.

brrrrrm 2024-04-01 17:05 UTC link

using AVX/FMA and unrolling loops does extremely little in the way of compiling to fast (>80% peak) GEMM code. These are very much intro steps that don't take into account many important ideas related to cache hierarchy, uop interactions, and even instruction decode time. The Fortran implementation is entirely and unquestionably inadequate for real high performance GEMMs.

JohnKemeny 2024-04-01 17:34 UTC link

I doubt that you get Python to run faster than C++ at 2004 hardware.

janwas 2024-04-01 19:50 UTC link

I think the previous code was using dot products, f32 instead of bf16.

Editorial Channel

What the content says

+0.20

Article 27 Cultural Participation

Low Advocacy

Editorial

+0.20

SETL

+0.10

Technical optimization of computational science expands participation in scientific advancement by making LLaMA accessible to broader population with diverse hardware capabilities.

+0.10

Article 2 Non-Discrimination

Low Practice

Editorial

+0.10

SETL

-0.09

CPU optimization makes LLaMA accessible to users regardless of hardware capability, implicitly supporting non-discriminatory technology access.

+0.10

Article 19 Freedom of Expression

Low Practice

Editorial

+0.10

SETL

-0.14

Technical article shares performance optimization knowledge openly, supporting dissemination of information and technical understanding.

Preamble Preamble

No content addressing human dignity, conscience, or universal human rights principles visible.

Article 1 Freedom, Equality, Brotherhood

No content addressing dignity or equality.

Article 3 Life, Liberty, Security

No content addressing life, liberty, or security.

Article 4 No Slavery

No content addressing slavery or servitude.

Article 5 No Torture

No content addressing torture or cruel treatment.

Article 6 Legal Personhood

No content addressing legal personhood.

Article 7 Equality Before Law

No content addressing equal protection before law.

Article 8 Right to Remedy

No content addressing remedies for rights violations.

Article 9 No Arbitrary Detention

No content addressing detention.

Article 10 Fair Hearing

No content addressing fair trial or justice.

Article 11 Presumption of Innocence

No content addressing criminal law or retroactive punishment.

Article 12 Privacy

No content addressing privacy.

Article 13 Freedom of Movement

No content addressing freedom of movement.

Article 14 Asylum

No content addressing asylum.

Article 15 Nationality

No content addressing nationality.

Article 16 Marriage & Family

No content addressing family or marriage.

Article 17 Property

No content addressing property rights.

Article 18 Freedom of Thought

No content addressing conscience or religion.

Article 20 Assembly & Association

No content addressing assembly or association.

Article 21 Political Participation

No content addressing political participation.

Article 22 Social Security

No content addressing social security or welfare.

Article 23 Work & Equal Pay

No content addressing labor rights or work conditions.

Article 24 Rest & Leisure

No content addressing rest or leisure.

Article 25 Standard of Living

No content addressing health, food, housing, or medical care.

Article 26 Education

No content addressing education access.

Article 28 Social & International Order

No content addressing social or international order.

Article 29 Duties to Community

No content addressing duties or responsibilities.

Article 30 No Destruction of Rights

No content preventing rights exercise.

Structural Channel

What the site does

+0.20

Article 19 Freedom of Expression

Low Practice

Structural

+0.20

Context Modifier

SETL

-0.14

Content published without restrictions; free online availability supports right to seek and receive information.

+0.15

Article 2 Non-Discrimination

Low Practice

Structural

+0.15

Context Modifier

SETL

-0.09

Content published openly without access barriers; free availability supports equal information access.

+0.15

Article 27 Cultural Participation

Low Advocacy

Structural

+0.15

Context Modifier

SETL

+0.10

Open publication of scientific optimization techniques enables participation in scientific and technological progress.

Preamble Preamble

No structural embodiment of human rights framework.

Article 1 Freedom, Equality, Brotherhood

No equality mechanisms observed.

Article 3 Life, Liberty, Security

No security mechanisms observed.

Article 4 No Slavery

No forced labor mechanisms observed.

Article 5 No Torture

No harmful practices observed.

Article 6 Legal Personhood

No legal recognition mechanisms observed.

Article 7 Equality Before Law

No legal equality mechanisms observed.

Article 8 Right to Remedy

No redress mechanisms observed.

Article 9 No Arbitrary Detention

No detention mechanisms observed.

Article 10 Fair Hearing

No judicial mechanisms observed.

Article 11 Presumption of Innocence

No legal protections observed.

Article 12 Privacy

No privacy policy or data protection measures observed.

Article 13 Freedom of Movement

No movement restrictions observed.

Article 14 Asylum

No asylum mechanisms observed.

Article 15 Nationality

No nationality mechanisms observed.

Article 16 Marriage & Family

No family protections observed.

Article 17 Property

No property protections observed.

Article 18 Freedom of Thought

No conscience protections observed.

Article 20 Assembly & Association

No community mechanisms observed.

Article 21 Political Participation

No political mechanisms observed.

Article 22 Social Security

No welfare mechanisms observed.

Article 23 Work & Equal Pay

No labor policies observed.

Article 24 Rest & Leisure

No work-time protections observed.

Article 25 Standard of Living

No welfare or health protections observed.

Article 26 Education

No educational mechanisms observed.

Article 28 Social & International Order

No international framework observed.

Article 29 Duties to Community

No restrictions on rights observed.

Article 30 No Destruction of Rights

No restriction mechanisms observed.

Supplementary Signals

How this content communicates, beyond directional lean. Learn more

Epistemic Quality ℹ

How well-sourced and evidence-based is this content?

0.36 medium claims

Sources		0.3
Evidence		0.2
Uncertainty		0.3
Purpose		0.8

Propaganda Flags ℹ

No manipulative rhetoric detected

0 techniques detected

Emotional Tone ℹ

Emotional character: positive/negative, intensity, authority

measured

Valence		+0.1
Arousal		0.3
Dominance		0.3

Transparency ℹ

Does the content identify its author and disclose interests?

0.50

✓ Author

More signals: context, framing & audience

Solution Orientation ℹ

Does this content offer solutions or only describe problems?

0.60 solution oriented

Reader Agency

0.6

Stakeholder Voice ℹ

Whose perspectives are represented in this content?

0.15 1 perspective

Speaks: institution

Temporal Framing ℹ

Is this content looking backward, at the present, or forward?

present immediate

Geographic Scope ℹ

What geographic area does this content cover?

global

Complexity ℹ

How accessible is this content to a general audience?

technical high jargon domain specific

Longitudinal · 10 evals

Audit Trail 30 entries

2026-02-28 09:51	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 09:51	rater_validation_warn	Light validation warnings for model llama-3.3-70b-wai: 0W 1R	- -
2026-02-28 09:51	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
	reasoning tech tutorial
2026-02-28 09:46	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 09:46	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
	reasoning tech tutorial
2026-02-28 09:46	rater_validation_warn	Light validation warnings for model llama-3.3-70b-wai: 0W 1R	- -
2026-02-28 09:33	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 09:33	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
	reasoning tech tutorial
2026-02-28 09:33	rater_validation_warn	Light validation warnings for model llama-3.3-70b-wai: 0W 1R	- -
2026-02-28 09:00	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 09:00	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
	reasoning tech tutorial
2026-02-28 09:00	rater_validation_warn	Light validation warnings for model llama-3.3-70b-wai: 0W 1R	- -
2026-02-28 08:30	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 08:30	rater_validation_warn	Light validation warnings for model llama-3.3-70b-wai: 0W 1R	- -
2026-02-28 08:30	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
	reasoning tech tutorial
2026-02-28 08:16	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 08:16	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral)
	reasoning tech tutorial
2026-02-28 08:16	rater_validation_warn	Light validation warnings for model llama-3.3-70b-wai: 0W 1R	- -
2026-02-28 07:48	dlq_replay	DLQ message 97776 replayed to WORKERS_AI_QUEUE: LLaMA now goes faster on CPUs	- -
2026-02-28 07:48	dlq_replay	DLQ message 97769 replayed to WORKERS_AI_QUEUE: LLaMA now goes faster on CPUs	- -
2026-02-28 07:48	dlq_replay	DLQ message 97761 replayed to WORKERS_AI_QUEUE: LLaMA now goes faster on CPUs	- -
2026-02-28 07:48	dlq_replay	DLQ message 97753 replayed to WORKERS_AI_QUEUE: LLaMA now goes faster on CPUs	- -
2026-02-28 07:26	eval	Evaluated by claude-haiku-4-5-20251001: +0.15 (Mild positive)
2026-02-28 01:41	dlq	Dead-lettered after 1 attempts: LLaMA now goes faster on CPUs	- -
2026-02-28 01:38	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-28 01:37	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-28 01:36	dlq_replay	DLQ message 97690 replayed to LLAMA_QUEUE: LLaMA now goes faster on CPUs	- -
2026-02-27 22:45	eval	Evaluated by deepseek-v3.2: +0.13 (Mild positive) 21,669 tokens
2026-02-27 22:21	eval	Evaluated by llama-4-scout-wai: 0.00 (Neutral)
2026-02-27 22:08	eval	Evaluated by claude-haiku-4-5: 0.00 (Neutral)

build 1ad9551+j7zs · deployed 2026-03-02 09:09 UTC · evaluated 2026-03-02 13:57:54 UTC