+0.30 Are LLMs not getting better?

Name: HRCB Evaluation: Are LLMs not getting better?
Item: Are LLMs not getting better?
Rating: 0.26
Author: Human Rights Observatory

Alpha This system is experimental. Scores and classifications are early-stage research and may be unreliable. Methodology →

Model: @cf/meta/llama-4-scout-17b-16e-instruct lite ND @cf/meta/llama-4-scout-17b-16e-instruct lite 0.00 claude-haiku-4-5-20251001 +0.30 @cf/meta/llama-3.3-70b-instruct-fp8-fast lite ND @cf/meta/llama-3.3-70b-instruct-fp8-fast lite 0.00 Compare

+0.30	Are LLMs not getting better? (entropicthoughts.com S:+0.20 )
	172 points by 4diii 3 days ago \| 157 comments on HN \| Mild positive Contested Low agreement (3 models) Editorial · v3.7 · 2026-03-15 22:53:36 0

Summary Free Expression Advocates

This blog post exercises and advocates for freedom of expression by publishing rigorous statistical critique of peer-reviewed LLM research. The author presents data reanalysis challenging prevailing industry narratives about AI progress, demonstrating intellectual freedom to question consensus and publish contrarian analysis. The content engages directly with Article 19 (freedom of opinion and expression) through its unrestricted dissemination of critical inquiry.

Article Heatmap

Negative Neutral Positive No Data

Aggregates

+0.30

+0.20

Weighted Mean	+0.26	Unweighted Mean	+0.26
Max	+0.26 Article 19	Min	+0.26 Article 19
Signal	1	No Data	30
Volatility	0.00 (Low)
Negative	0	Channels	E: 0.6 S: 0.4
SETL ℹ	+0.17	Editorial-dominant
FW Ratio ℹ	50%	3 facts · 3 inferences
Agreement	Low	3 models · spread ±0.170

Evidence 2% coverage ℹ

  1M   30 ND 

Theme Radar

HN Discussion 19 top-level · 26 replies

mike_hearn 2026-03-12 12:28 UTC link

That's an interesting claim, but I don't see it in my own work. They have got better but it's very hard to quantify. I just find myself editing their work much less these days (currently using GPT 5.4).

boonzeet 2026-03-12 12:29 UTC link

Interesting article, although with so few data points and such a specific time slice it is difficult to draw serious conclusions about the "improvement" of LLM models.

It's notably lacking newer models (4.5 Opus, 4.6 Sonnet) and models from Gemini.

LLMs appear to naturally progress in short leaps followed by longer plateaus, as breakthroughs are developed such as chain-of-thought, mixture-of-experts, sub-agents, etc.

reedf1 2026-03-12 12:30 UTC link

Given that it is the general consensus that a step function occurred with Opus 4.5/4.6 only 3 months ago - it seems like an insane omission.

Flavius 2026-03-12 12:31 UTC link

> This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?

Because it's not true. They have improved tremendously in the last year, but it looks like they've hit a wall in the last 3 months. Still seeing some improvements but mostly in skills and token use optimization.

jeffnv 2026-03-12 12:32 UTC link

I don't think it's true, but am I alone in wishing it was? My world is disrupted somewhat but so far I don't think we have a thing that upends our way of life completely yet. If it stayed exactly this good I'd be pretty content.

curiouscube 2026-03-12 12:44 UTC link

There is a decent case for this thesis to hold true especially if we look at the shift in training regimes and benchmarking over the last 1-2 years. Frontier labs don't seem to really push pure size/capability anymore, it's an all in focus on agentic AI which is mainly complex post-training regimes.

There are good reasons why they don't or can't do simple param upscaling anymore, but still, it makes me bearish on AGI since it's a slow, but massive shift in goal setting.

In practice this still doesn't mean 50 % of white collar can't be automated though.

antisthenes 2026-03-12 12:48 UTC link

They are getting better, but they are also hitting diminishing returns.

There's only so much data to train on, and we are unlikely to see giant leaps in performance as we did in 2023/2024.

2026-27 will be the years of primarily ecosystem/agentic improvements and reducing costs.

idorozin 2026-03-12 12:50 UTC link

My experience has been that raw “one-shot intelligence” hasn’t improved as dramatically in the last year, but the workflow around the models has improved massively.

When you combine models with:

tool use

planning loops

agents that break tasks into smaller pieces

persistent context / repos

the practical capability jump is huge.

sunaurus 2026-03-12 12:53 UTC link

I am pretty convinced that for most types of day to day work, any perceived improvements from the latest Claude models for example were total placebo. In blind tests and with normal tasks, people would probably have no idea if they're using Opus 4.5 or 4.6.

pu_pe 2026-03-12 12:55 UTC link

Benchmaxxing aside, if you are using those tools for programming on a regular basis it should be self-evident that they are improving. I find it very hard to believe that someone using LLMs today vs what was available one year ago (Claude Code released Feb 2025) would have any difficulty answering this question.

wongarsu 2026-03-12 12:57 UTC link

I don't find this very compelling. If you look at the actual graph they are referencing but never showing [1] there is a clear improvement from Sonnet 3.7 -> Opus 4.0 -> Sonnet 4.5. This is just hidden in their graph because they are only looking at the number of PRs that are mergable with no human feedback whatsoever (a high standard even for humans).

And even if we were to agree that that's a reasonable standard, GPT 5 shouldn't be included. There is only one datapoint for all OpenAI models. That data point more indicative of the performance of OpenAI models (and the harness used) than of any progression. Once you exclude it it matches what you would expect from a logistic model. Improvements have slowed down, but not stopped

1: https://metr.org/assets/images/many-swe-bench-passing-prs-wo...

utopiah 2026-03-12 12:59 UTC link

I gave up on trying months ago, you can see the timeline on top of https://fabien.benetou.fr/Content/SelfHostingArtificialIntel...

Truth is I'm probably wrong. I should keep on testing ... but at the same time I precisely gave up because I didn't think the trend was fast enough to keep on investing on checking it so frequently. Now I just read this kind of post, ask around (mainly arguing with comments asking for genuine examples that should be "surprising" and kept on being disappointed) and that seems to be enough for a proxy.

I should though, as I mentioned in another comment, keep track of failed attempts.

PS: I check solely on self-hosted models (even if not on my machine but least on machines I could setup) because I do NOT trust the scaffolding around proprietary closed sources models. I can't verify that nobody is in the loop.

sd9 2026-03-12 13:01 UTC link

You really can't model these 5 data points with a linear regression or a step function. The models are of different sizes / use cases, and from two different labs. I feel like what we've observed generally is that different labs releasing similarly sized models at similar times are generally pretty similar.

I think the only reasonable thing to read into is Sonnet 3.5 -> 3.7 -> 4.5. But yeah, you just can't draw a line through this thing.

I will die on the hill that LLMs are getting better, particularly Anthropic's releases since December. But I can't point at a graph to prove that, I'm just drawing on my personal experience. I do use Claude Code though, so I think a large part of the improvement comes from the harness.

Havoc 2026-03-12 13:02 UTC link

As they become more capable peoples commits will also become more ambitious.

So I’d say fairly flat commit acceptance numbers make sense even in the context of improving LLMs

aerhardt 2026-03-12 13:04 UTC link

I feel that two things are true at the same time:

1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.

2) The quality of the code is still quite often terrible. Quadruple-nested control flow abounds. Software architecture in rather small scopes is unsound. People say AI is “good at front end” but I see the worst kind of atrocities there (a few days ago Codex 5.3 tried to inject a massive HTML element with a CSS before hack, rather than proprerly refactoring markup)

Two forces feel true simultaneously but in permanent tension. I still cannot make out my mind and see the synthesis in the dialectic, where this is truly going, if we’re meaningfully moving forward or mostly moving in circles.

orwin 2026-03-12 13:04 UTC link

I think what happened with static image generation is happening with LLMs. Basically the tools around are becoming better, but all the AI improvements stall, the error rate stay the same (but external tools curate the results so it won't be noticeable if you don't run your own model), the accuracy is still slightly improving, but slower and slower, and never reach the 'perfect' point. Basically stablediffusion early 2025

BoppreH 2026-03-12 13:31 UTC link

Controversial opinion from a casual user, but state-of-art LLMs now feel to me more intelligent then the average person on the steet. Also explains why training on more average-quality data (if there's any left) is not making improvements.

But LLMs are hamstrung by their harnesses. They are doing the equivalent of providing technical support via phone call: little to no context, and limited to a bidirectional stream of words (tokens). The best agent harnesses have the equivalent of vision-impairment accessibility interfaces, and even those are still subpar.

Heck, giving LLMs time to think was once a groundbreaking idea. Yesterday I saw Claude Code editing a file using shell redirects! It's barbaric.

I expect future improvements to come from harness improvements, especially around sub agents/context rollbacks (to work around the non-linear cost of context) and LLM-aligned "accessibility tools". That, or more synthetic training data.

Incipient 2026-03-12 16:09 UTC link

I feel even if the models are stagnating, the tooling around them, and the integrations and harnesses they have are getting significantly more capable (if not always 'better' - the recent vscode update really handicapped them for some reason). Things like the new agent from booking.com or whatever, if it could integrate with all hotels, activities, mapping tools, flight system, etc could be hugely powerful.

Assuming we get no better than opus 4.6, they're very capable. Even if they make up nonsense 5% of the time!

globular-toast 2026-03-13 06:51 UTC link

I reckon LLM merge rates will go up, but not necessarily due to quality improvements. Instead I think maintainers will just become fatigued. The amount of code I'm expected to review now is way higher than before. And while I'm reviewing you know more is being generated. I'm sure I've let through more crap due to this fatigue attack on me.

nkozyra 2026-03-12 12:33 UTC link

The problem with evals is the underlying rubric will always be either subjective, or a quantitative score based on something that is likely now baked into the training set directly.

You kind of have to go on "feels" for a lot of this.

dwedge 2026-03-12 12:34 UTC link

Without meaning to sound dismissive, because I'm really not intending to, there's also the possibility that you've gotten worse after enough time using them. You're treating yourself as a constant in this, but man cannot walk in the same river twice.

jeremyjh 2026-03-12 12:35 UTC link

This has been the general consensus for about three years now. "Drastic increases in capability have happened the last 3-6 months" have been a constant refrain.

Without any data from the study past September I think its not unreasonable, if you want to make an argument based on evidence.

For me personally, I agree with you, I'm really seeing it as well.

Toutouxc 2026-03-12 12:35 UTC link

There's a consensus that SOMETHING changed with Opus 4.5. It might have been the "merge rates" metric, it might have not.

I'm certainly getting faster and cleaner-looking solutions for certain issues on Opus 4.6 than I was 5 months ago, but I'm not sure about the ability to solve (or even weigh in) the actual hard stuff, i.e. the stuff I'm paid for.

And I'm definitely not sure about the supposed big step between 4.5 and 4.6. I'm literally not seeing any.

postflopclarity 2026-03-12 12:40 UTC link

> but mostly in skills and token use optimization.

I have heard rumors that token use optimization has been a recent focus to try to tidy up the financials of these companies before they IPO. take that with a grain of salt though

cj 2026-03-12 12:40 UTC link

I agree with your sentiment, but I think we've yet to see the full application of the current technology. (Even if LLMs themselves don't improve, there's significant opportunity for people to use it in ways not currently being done)

saulpw 2026-03-12 12:45 UTC link

After only 3 months (!) you can claim a plateau, but not a wall.

AussieWog93 2026-03-12 12:55 UTC link

I'd agree with you on 4.5 to 4.6, but going from gpt-5 or 4.0 to 4.5 was night and day.

roxolotl 2026-03-12 13:05 UTC link

I don't know that graph to me shows Sonnet 4.5 as worse than 3.7. Maybe the automated grader is finding code breakages in 3.7 and not breaking that out? But I'd much prefer to add code that is a different style to my codebase than code that breaks other code. But even ignoring that the pass rate is almost identical between the two models.

orwin 2026-03-12 13:08 UTC link

> People say AI is “good at front end”

I only say that because I'm a shit frontend dev. Honestly, I'm not that bad anymore, but I'm still shit, and the AI will probably generate better code than I will.

GaggiX 2026-03-12 13:12 UTC link

Image quality has improved a lot in recent months thanks to better models. The ability of people to notice these improvements is plateauing because they are not trained to spot artifacts, which are becoming more obscure.

yorwba 2026-03-12 13:14 UTC link

Yes, I think this is basically an instance of the "emergent abilities mirage." https://arxiv.org/abs/2304.15004

If you measure completion rate on a task where a single mistake can cause a failure, you won't see noticeable improvements on that metric until all potential sources of error are close to being eliminated, and then if they do get eliminated it causes a sudden large jump in performance.

That's fine if you just want to know whether the current state is good enough on your task of choice, but if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

Zababa 2026-03-12 13:17 UTC link

I think it is important to try to find more rigorous things to test than the general sentiment of the people using the tools. If only because the more benchmarks we have the more we can improve models without regressions. METR is asking a really interesting question here, "are models improving at making one shot PRs?". The answer seems to be, yes, but slower than benchmarks suggest, if you look at the pass rate of different versions of Claude Sonnet. A reasonable answer is "you're not supposed to use them by making one shot PRs", but then ideally we would need to have some kind of standarized test for the ability of models to incorporate feedback and evolve PRs.

jygg4 2026-03-12 13:44 UTC link

The models lose the ability to inject subtle and nuance stuff as they scale up, is what I’ve observed.

xyzsparetimexyz 2026-03-12 13:49 UTC link

Steet? Do you mean street? They're smarter in the same way a search engine is smarter.

BoumTAC 2026-03-12 14:53 UTC link

It's because they are getting so good it's impossible to recognize them.

Haiku 4.5 is already so good it's ok for 80% (95%?) of dev tasks.

jygg4 2026-03-12 14:57 UTC link

Indeed. Why is this post down voted? There’s always trade-offs taking place, it’s good to call them out.

hrmtst93837 2026-03-12 21:14 UTC link

Focusing on flashy breakthroughs hides the issue that bigger models and merge benchmarks rarely translate to reliability in real codebases. For routine merges, subtle regressions and context quirks matter more than headline progress. Unless evals stress nasty scenarios like multi-file renames with tricky conflicts, the numbers are mostly for show. Progress will plateau until someone tunes for the boring, messy cases that waste dev time.

lich_king 2026-03-12 21:40 UTC link

> In practice this still doesn't mean 50 % of white collar can't be automated though.

Let me ask you this, though: if we wanted to, what percentage of white collar jobs could have been automated or eliminated prior to LLMs?

Meta has nearly 80k employees to basically run two websites and three mobile apps. There were 18k people working at LinkedIn! Many big tech companies are massive job programs with some product on the side. Administrative business partners, program managers, tech writers, "stewards", "champions", "advocates", 10-layer-deep reporting chains... engineers writing cafe menu apps and pet programming languages... a team working on in-house typefaces... the list goes on.

I can see AI producing shifts in the industry by reducing demand for meaningful work, but I doubt the outcome here is mass unemployment. There's an endless supply of bs jobs as long as the money is flowing.

8note 2026-03-12 22:17 UTC link

> But LLMs are hamstrung by their harnesses

entirely so. i think anthropic updated something about the compact algorithm recently, and its gone from working well over long times to basically garbage whenever a compact happens

mountainriver 2026-03-12 23:51 UTC link

Yeah same, and all my coworkers feel the same.

Most of us have been coding for ages. I actually find it really odd people keep trying to disprove things that are relatively obvious with LLMs

sumeno 2026-03-13 00:18 UTC link

This has basically been my experience since Sonnet 3.5. I've been working on a personal project on and off with various models and things since then and the biggest difference between then and now is that it will do larger chunks of work than it did before, but the quality of the code is not particularly better, I still have to do a lot of cleanup and it still goes off the rails pretty frequently. I have to do fewer individual prompts, but the time spent reviewing the code takes longer because I also have to mentally process and fix larger chunks of code too

Is it a better user experience now? Yes. Has it boosted my productivity on this project? Absolutely.

But it still needs a ton of hand holding for anything complicated and I still deal with tons of "OK, this bug is fixed now!" followed by manually confirming a bug still exists.

globular-toast 2026-03-13 07:02 UTC link

It's so disrespectful to say an LLM is more intelligent than a person on the street. The LLM has nothing at stake, cares not a sausage about the consequences of what it spits out. People have all kinds of pressures, dependants, and personal issues like health. Our thoughts and actions have real consequences. It's so easy to be intelligent when you're the pretend human that gets switched on for five minutes then switched off again.

leoedin 2026-03-13 10:06 UTC link

This matches my experience too. The models write code that would never pass a review normally. Mega functions, "copy and pasted" code with small changes, deep nested conditionals and loops. All the stuff we've spent a lot of time trying to minimise!

You could argue it's OK because a model can always fix it later. But the problem comes when there's subtle logic bugs and its basically impossible to understand. Or fixing the bug in one place doesn't fix it in the 10 other places almost the same code exists.

I strongly suspect that LLMs, like all technologies, are going to follow an S curve of capability. The question is where in that S curve we are right now.

zx8080 2026-03-13 10:08 UTC link

> People say AI is “good at front end” but I see the worst kind of atrocities there

It's commonly universal to say "AI is great in X", where one is not professional in X. It's because that's how AI is designed: to output tokens according to stats, not logic, not semantic, and not meaning: stats.

SkyPuncher 2026-03-14 03:16 UTC link

4.6 has been a very, very slight regression for me, but the tradeoff is they've added better compaction - and now larger context windows. That's a reasonable tradeoff for me.

Editorial Channel

What the content says

+0.30

Article 19 Freedom of Expression

Medium A: The author exercises freedom of expression to publish scientific critique and analysis.

Editorial

+0.30

SETL

+0.17

Content demonstrates active exercise of freedom of expression: author publishes analytical critique of peer-reviewed research, presents alternative data interpretations, and challenges prevailing claims about LLM capabilities without apparent censorship or restriction.

Preamble Preamble

Content does not directly address human dignity, freedom, or the philosophical foundations of human rights.

Article 1 Freedom, Equality, Brotherhood

Content does not address equality, dignity, or inherent rights of persons.

Article 2 Non-Discrimination

Content does not discuss non-discrimination, protected characteristics, or protected status.

Article 3 Life, Liberty, Security

Content does not address rights to life, liberty, or personal security.

Article 4 No Slavery

Content does not address slavery or servitude.

Article 5 No Torture

Content does not address torture or cruel/inhuman/degrading treatment.

Article 6 Legal Personhood

Content does not address legal personhood or recognition before law.

Article 7 Equality Before Law

Content does not address equal protection under law.

Article 8 Right to Remedy

Content does not address remedies for human rights violations.

Article 9 No Arbitrary Detention

Content does not address arbitrary arrest or detention.

Article 10 Fair Hearing

Content does not address fair trial or impartial hearing.

Article 11 Presumption of Innocence

Content does not address criminal responsibility, presumption of innocence, or ex post facto laws.

Article 12 Privacy

Content does not address privacy, family, home, or correspondence.

Article 13 Freedom of Movement

Content does not address freedom of movement or internal displacement.

Article 14 Asylum

Content does not address asylum or refuge.

Article 15 Nationality

Content does not address nationality or citizenship.

Article 16 Marriage & Family

Content does not address marriage, family, or property rights.

Article 17 Property

Content does not address property rights or ownership.

Article 18 Freedom of Thought

Content does not address freedom of thought, conscience, or religion.

Article 20 Assembly & Association

Content does not address freedom of peaceful assembly or association.

Article 21 Political Participation

Content does not address political participation or democratic processes.

Article 22 Social Security

Content does not address social security or labor rights.

Article 23 Work & Equal Pay

Content does not address work, employment, fair wages, or labor conditions.

Article 24 Rest & Leisure

Content does not address rest, leisure, or working hours.

Article 25 Standard of Living

Content does not address health, nutrition, housing, or social services.

Article 26 Education

Content does not address education or educational access.

Article 27 Cultural Participation

Content does not address cultural participation or intellectual property.

Article 28 Social & International Order

Content does not address social or international order necessary for rights realization.

Article 29 Duties to Community

Content does not address duties or limitations on rights exercise.

Article 30 No Destruction of Rights

Content does not prohibit activity aimed at destroying rights or freedoms.

Structural Channel

What the site does

Domain Context Profile

Element	Modifier	Note
Legal & Terms
Privacy	—	No privacy policy or data collection mechanisms observable on URL.
Terms of Service	—	No terms of service observable on URL.
Identity & Mission
Mission	—	No explicit mission statement observable; appears to be personal analytical blog.
Editorial Code	—	No editorial guidelines observable.
Ownership	—	Author name 'Entropicthoughts' suggests personal/independent authorship.
Access & Distribution
Access Model	—	Content appears freely accessible; no paywall or login requirement.
Ad/Tracking	—	No advertising or tracking mechanisms observable on page.
Accessibility	—	Content renders in plain HTML; no apparent accessibility barriers detected.

+0.20

Article 19 Freedom of Expression

Medium A: The author exercises freedom of expression to publish scientific critique and analysis.

Structural

+0.20

Context Modifier

0.00

SETL

+0.17

Website permits publication of contrarian viewpoints without apparent content moderation, paywall, or login restrictions. Content remains freely accessible.

Preamble Preamble

Website structure offers no observable mechanisms that substantively enable or restrict human rights exercise.

Article 1 Freedom, Equality, Brotherhood

No structural mechanisms observable that relate to equal treatment or dignity affirmation.

Article 2 Non-Discrimination

No observable structural discrimination or protections related to enumerated characteristics.

Article 3 Life, Liberty, Security

No structural mechanisms that affect life, liberty, or security.

Article 4 No Slavery

No structural mechanisms related to labor coercion or servitude.

Article 5 No Torture

No structural mechanisms that facilitate or impede protection from torture.

Article 6 Legal Personhood

No observable structural impact on legal recognition.

Article 7 Equality Before Law

No structural mechanisms observable that facilitate or obstruct equal legal protection.

Article 8 Right to Remedy

No observable structural mechanisms affecting recourse or remedy.

Article 9 No Arbitrary Detention

No structural mechanisms related to arrest or detention.

Article 10 Fair Hearing

No structural mechanisms observable affecting judicial process.

Article 11 Presumption of Innocence

No structural mechanisms related to criminal justice.

Article 12 Privacy

No personal data collection or privacy violations observable on page; domain-level assessment shows no privacy policy but also no data collection.

Article 13 Freedom of Movement

No structural mechanisms affecting freedom of movement.

Article 14 Asylum

No structural mechanisms observable affecting asylum or refuge.

Article 15 Nationality

No structural mechanisms observable affecting citizenship rights.

Article 16 Marriage & Family

No structural mechanisms observable affecting family or property rights.

Article 17 Property

No structural mechanisms observable affecting property rights.

Article 18 Freedom of Thought

No structural mechanisms observable affecting conscience or religion.

Article 20 Assembly & Association

No structural mechanisms observable affecting assembly or association.

Article 21 Political Participation

No structural mechanisms observable affecting political participation.

Article 22 Social Security

No structural mechanisms observable affecting social protection.

Article 23 Work & Equal Pay

No structural mechanisms observable affecting labor rights or employment.

Article 24 Rest & Leisure

No structural mechanisms observable affecting rest or recreation.

Article 25 Standard of Living

No structural mechanisms observable affecting health or welfare rights.

Article 26 Education

Website presents technical content without apparent accessibility barriers, but structure does not substantively address educational rights or access.

Article 27 Cultural Participation

No structural mechanisms observable affecting cultural or intellectual property rights.

Article 28 Social & International Order

No structural mechanisms observable affecting social or international order.

Article 29 Duties to Community

No structural mechanisms observable that enforce or establish community duties.

Article 30 No Destruction of Rights

No structural mechanisms observable that would prevent exercise of Article 30.

Supplementary Signals

How this content communicates, beyond directional lean. Learn more

Epistemic Quality ℹ

How well-sourced and evidence-based is this content?

0.79 medium claims

Sources		0.8
Evidence		0.8
Uncertainty		0.8
Purpose		0.8

Propaganda Flags ℹ

No manipulative rhetoric detected

0 techniques detected

Emotional Tone ℹ

Emotional character: positive/negative, intensity, authority

measured

Valence		-0.2
Arousal		0.4
Dominance		0.5

Transparency ℹ

Does the content identify its author and disclose interests?

0.33

✗ Author

More signals: context, framing & audience

Solution Orientation ℹ

Does this content offer solutions or only describe problems?

0.28 problem only

Reader Agency

0.4

Stakeholder Voice ℹ

Whose perspectives are represented in this content?

0.45 3 perspectives

Speaks: individuals

About: corporationinstitution

Temporal Framing ℹ

Is this content looking backward, at the present, or forward?

present short term

Geographic Scope ℹ

What geographic area does this content cover?

global

Complexity ℹ

How accessible is this content to a general audience?

moderate medium jargon domain specific

Longitudinal 597 HN snapshots · 210 evals

Audit Trail 230 entries

2026-03-16 00:48	eval_success	PSQ evaluated: g-PSQ=0.280 (3 dims)	- -
2026-03-16 00:47	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-16 00:44	eval_success	Lite evaluated: Neutral (-0.08)	- -
2026-03-16 00:44	model_divergence	Cross-model spread 0.34 exceeds threshold (2 models)	- -
2026-03-16 00:44	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-16 00:44	rater_validation_warn	Lite validation warnings for model llama-4-scout-wai: 1W 0R	- -
2026-03-15 22:53	eval_success	Evaluated: Mild positive (0.26)	- -
2026-03-15 22:53	model_divergence	Cross-model spread 0.34 exceeds threshold (2 models)	- -
2026-03-15 22:53	eval	Evaluated by claude-haiku-4-5-20251001: +0.26 (Mild positive) 10,387 tokens 0.00
2026-03-15 22:49	eval_success	Evaluated: Mild positive (0.26)	- -
2026-03-15 22:49	model_divergence	Cross-model spread 0.34 exceeds threshold (2 models)	- -
2026-03-15 22:49	eval	Evaluated by claude-haiku-4-5-20251001: +0.26 (Mild positive) 10,264 tokens
2026-03-15 22:22	eval_success	PSQ evaluated: g-PSQ=0.280 (3 dims)	- -
2026-03-15 22:22	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 21:54	eval_success	Lite evaluated: Neutral (-0.08)	- -
2026-03-15 21:54	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 21:54	rater_validation_warn	Lite validation warnings for model llama-4-scout-wai: 1W 0R	- -
2026-03-15 18:33	eval_success	Lite evaluated: Neutral (-0.08)	- -
2026-03-15 18:33	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 18:33	rater_validation_warn	Lite validation warnings for model llama-4-scout-wai: 1W 0R	- -
2026-03-15 18:20	eval_success	PSQ evaluated: g-PSQ=0.280 (3 dims)	- -
2026-03-15 18:20	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 17:22	eval_success	Lite evaluated: Neutral (-0.08)	- -
2026-03-15 17:22	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 17:22	rater_validation_warn	Lite validation warnings for model llama-4-scout-wai: 1W 0R	- -
2026-03-15 17:02	eval_success	PSQ evaluated: g-PSQ=0.280 (3 dims)	- -
2026-03-15 17:02	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 16:09	eval_success	Lite evaluated: Neutral (-0.08)	- -
2026-03-15 16:09	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 16:09	rater_validation_warn	Lite validation warnings for model llama-4-scout-wai: 1W 0R	- -
2026-03-15 15:48	eval_success	PSQ evaluated: g-PSQ=0.280 (3 dims)	- -
2026-03-15 15:48	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 15:34	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 15:06	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 14:57	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 14:29	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 14:22	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 13:49	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 13:45	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 13:09	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) +0.16
2026-03-15 13:07	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 12:30	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 12:30	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) -0.16
2026-03-15 11:52	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 11:48	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 11:13	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 11:06	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 10:31	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 10:25	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 09:52	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 09:41	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 09:10	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 08:58	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 08:29	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 08:16	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 07:46	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 07:29	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 07:04	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 06:49	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 06:30	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 06:11	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 05:54	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 05:34	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 05:19	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 04:56	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 04:42	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 04:19	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 04:07	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 03:41	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 03:32	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 03:02	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 02:54	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 02:24	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 02:17	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 01:44	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 01:42	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 01:13	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 01:12	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 00:43	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 00:38	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 23:36	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 23:27	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 22:56	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 22:48	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 21:53	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 21:46	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 20:35	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) +0.16
2026-03-14 20:28	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 19:39	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) -0.16
2026-03-14 19:31	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 18:33	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) +0.16
2026-03-14 18:31	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 16:55	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) -0.16
2026-03-14 16:54	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 15:47	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 15:46	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 15:03	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 15:01	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 14:26	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 14:24	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 13:49	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 13:46	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 13:12	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 13:10	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 12:36	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 12:32	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 12:00	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 11:57	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 11:25	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 11:23	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 10:50	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 10:47	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 10:12	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 10:10	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 09:34	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 09:31	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 08:54	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 08:49	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 08:11	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 08:09	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 07:32	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 07:29	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 06:51	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 06:49	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 06:09	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 06:06	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 05:31	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 05:29	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 04:54	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 04:51	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 04:15	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 04:11	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 03:38	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 03:33	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 02:56	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 02:52	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 02:16	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 02:12	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 01:39	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 01:34	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 01:04	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 01:01	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 00:37	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 00:33	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 23:36	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 23:16	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 22:31	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 21:57	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 21:07	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 20:57	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 19:59	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 19:47	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 18:35	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 18:22	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 17:21	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 16:48	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 15:54	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 15:42	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 15:18	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 15:02	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 14:40	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 14:19	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 13:57	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 13:40	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 13:22	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 13:06	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 12:47	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 12:30	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 12:12	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 11:55	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 11:35	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 11:19	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 10:57	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 10:40	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 10:19	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 10:02	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 09:40	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 09:21	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 09:02	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 08:43	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 08:26	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 08:05	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 07:45	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 07:24	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 07:05	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 06:43	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 06:27	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 06:07	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) +0.16
2026-03-13 05:51	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 05:32	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-13 05:17	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 04:56	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) -0.16
2026-03-13 04:42	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 04:19	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 04:06	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 03:44	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 03:29	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 03:09	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 02:54	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 02:34	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 02:19	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 01:59	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 01:44	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) +0.16
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 01:24	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 01:16	eval	Evaluated by llama-4-scout-wai: -0.24 (Mild negative) -0.16
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 00:54	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 00:48	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 00:09	eval	Evaluated by llama-3.3-70b-wai-psq: +0.17 (Mild positive)
2026-03-13 00:05	eval	Evaluated by llama-3.3-70b-wai: -0.08 (Neutral)
	reasoning Technical analysis of LLM improvement
2026-03-12 23:54	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 23:40	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 22:42	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) +0.16
2026-03-12 22:28	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 21:57	eval	Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) -0.16
2026-03-12 21:46	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 21:16	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 21:15	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 20:31	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 19:24	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 18:50	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 17:54	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 17:11	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 16:27	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 15:49	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 15:03	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 14:18	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 13:44	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 13:39	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 13:02	eval	Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive)
2026-03-12 12:59	eval	Evaluated by llama-4-scout-wai: -0.08 (Neutral)
	reasoning Technical blog post discussing LLM performance with no explicit human rights discussion

build ee2b489+gzrb · deployed 2026-03-10 22:52 UTC · evaluated 2026-03-16 01:43:43 UTC