+0.30 Are LLMs not getting better? (entropicthoughts.com S:+0.20 )
172 points by 4diii 3 days ago | 157 comments on HN | Mild positive Contested Low agreement (3 models) Editorial · v3.7 · 2026-03-15 22:53:36 0
Summary Free Expression Advocates
This blog post exercises and advocates for freedom of expression by publishing rigorous statistical critique of peer-reviewed LLM research. The author presents data reanalysis challenging prevailing industry narratives about AI progress, demonstrating intellectual freedom to question consensus and publish contrarian analysis. The content engages directly with Article 19 (freedom of opinion and expression) through its unrestricted dissemination of critical inquiry.
Article Heatmap
Preamble: ND — Preamble Preamble: No Data — Preamble P Article 1: ND — Freedom, Equality, Brotherhood Article 1: No Data — Freedom, Equality, Brotherhood 1 Article 2: ND — Non-Discrimination Article 2: No Data — Non-Discrimination 2 Article 3: ND — Life, Liberty, Security Article 3: No Data — Life, Liberty, Security 3 Article 4: ND — No Slavery Article 4: No Data — No Slavery 4 Article 5: ND — No Torture Article 5: No Data — No Torture 5 Article 6: ND — Legal Personhood Article 6: No Data — Legal Personhood 6 Article 7: ND — Equality Before Law Article 7: No Data — Equality Before Law 7 Article 8: ND — Right to Remedy Article 8: No Data — Right to Remedy 8 Article 9: ND — No Arbitrary Detention Article 9: No Data — No Arbitrary Detention 9 Article 10: ND — Fair Hearing Article 10: No Data — Fair Hearing 10 Article 11: ND — Presumption of Innocence Article 11: No Data — Presumption of Innocence 11 Article 12: ND — Privacy Article 12: No Data — Privacy 12 Article 13: ND — Freedom of Movement Article 13: No Data — Freedom of Movement 13 Article 14: ND — Asylum Article 14: No Data — Asylum 14 Article 15: ND — Nationality Article 15: No Data — Nationality 15 Article 16: ND — Marriage & Family Article 16: No Data — Marriage & Family 16 Article 17: ND — Property Article 17: No Data — Property 17 Article 18: ND — Freedom of Thought Article 18: No Data — Freedom of Thought 18 Article 19: +0.26 — Freedom of Expression 19 Article 20: ND — Assembly & Association Article 20: No Data — Assembly & Association 20 Article 21: ND — Political Participation Article 21: No Data — Political Participation 21 Article 22: ND — Social Security Article 22: No Data — Social Security 22 Article 23: ND — Work & Equal Pay Article 23: No Data — Work & Equal Pay 23 Article 24: ND — Rest & Leisure Article 24: No Data — Rest & Leisure 24 Article 25: ND — Standard of Living Article 25: No Data — Standard of Living 25 Article 26: ND — Education Article 26: No Data — Education 26 Article 27: ND — Cultural Participation Article 27: No Data — Cultural Participation 27 Article 28: ND — Social & International Order Article 28: No Data — Social & International Order 28 Article 29: ND — Duties to Community Article 29: No Data — Duties to Community 29 Article 30: ND — No Destruction of Rights Article 30: No Data — No Destruction of Rights 30
Negative Neutral Positive No Data
Aggregates
E
+0.30
S
+0.20
Weighted Mean +0.26 Unweighted Mean +0.26
Max +0.26 Article 19 Min +0.26 Article 19
Signal 1 No Data 30
Volatility 0.00 (Low)
Negative 0 Channels E: 0.6 S: 0.4
SETL +0.17 Editorial-dominant
FW Ratio 50% 3 facts · 3 inferences
Agreement Low 3 models · spread ±0.170
Evidence 2% coverage
1M 30 ND
Theme Radar
Foundation Security Legal Privacy & Movement Personal Expression Economic & Social Cultural Order & Duties Foundation: 0.00 (0 articles) Security: 0.00 (0 articles) Legal: 0.00 (0 articles) Privacy & Movement: 0.00 (0 articles) Personal: 0.00 (0 articles) Expression: 0.26 (1 articles) Economic & Social: 0.00 (0 articles) Cultural: 0.00 (0 articles) Order & Duties: 0.00 (0 articles)
HN Discussion 19 top-level · 26 replies
mike_hearn 2026-03-12 12:28 UTC link
That's an interesting claim, but I don't see it in my own work. They have got better but it's very hard to quantify. I just find myself editing their work much less these days (currently using GPT 5.4).
boonzeet 2026-03-12 12:29 UTC link
Interesting article, although with so few data points and such a specific time slice it is difficult to draw serious conclusions about the "improvement" of LLM models.

It's notably lacking newer models (4.5 Opus, 4.6 Sonnet) and models from Gemini.

LLMs appear to naturally progress in short leaps followed by longer plateaus, as breakthroughs are developed such as chain-of-thought, mixture-of-experts, sub-agents, etc.

reedf1 2026-03-12 12:30 UTC link
Given that it is the general consensus that a step function occurred with Opus 4.5/4.6 only 3 months ago - it seems like an insane omission.
Flavius 2026-03-12 12:31 UTC link
> This means llms have not improved in their programming abilities for over a year. Isn’t that wild? Why is nobody talking about this?

Because it's not true. They have improved tremendously in the last year, but it looks like they've hit a wall in the last 3 months. Still seeing some improvements but mostly in skills and token use optimization.

jeffnv 2026-03-12 12:32 UTC link
I don't think it's true, but am I alone in wishing it was? My world is disrupted somewhat but so far I don't think we have a thing that upends our way of life completely yet. If it stayed exactly this good I'd be pretty content.
curiouscube 2026-03-12 12:44 UTC link
There is a decent case for this thesis to hold true especially if we look at the shift in training regimes and benchmarking over the last 1-2 years. Frontier labs don't seem to really push pure size/capability anymore, it's an all in focus on agentic AI which is mainly complex post-training regimes.

There are good reasons why they don't or can't do simple param upscaling anymore, but still, it makes me bearish on AGI since it's a slow, but massive shift in goal setting.

In practice this still doesn't mean 50 % of white collar can't be automated though.

antisthenes 2026-03-12 12:48 UTC link
They are getting better, but they are also hitting diminishing returns.

There's only so much data to train on, and we are unlikely to see giant leaps in performance as we did in 2023/2024.

2026-27 will be the years of primarily ecosystem/agentic improvements and reducing costs.

idorozin 2026-03-12 12:50 UTC link
My experience has been that raw “one-shot intelligence” hasn’t improved as dramatically in the last year, but the workflow around the models has improved massively.

When you combine models with:

tool use

planning loops

agents that break tasks into smaller pieces

persistent context / repos

the practical capability jump is huge.

sunaurus 2026-03-12 12:53 UTC link
I am pretty convinced that for most types of day to day work, any perceived improvements from the latest Claude models for example were total placebo. In blind tests and with normal tasks, people would probably have no idea if they're using Opus 4.5 or 4.6.
pu_pe 2026-03-12 12:55 UTC link
Benchmaxxing aside, if you are using those tools for programming on a regular basis it should be self-evident that they are improving. I find it very hard to believe that someone using LLMs today vs what was available one year ago (Claude Code released Feb 2025) would have any difficulty answering this question.
wongarsu 2026-03-12 12:57 UTC link
I don't find this very compelling. If you look at the actual graph they are referencing but never showing [1] there is a clear improvement from Sonnet 3.7 -> Opus 4.0 -> Sonnet 4.5. This is just hidden in their graph because they are only looking at the number of PRs that are mergable with no human feedback whatsoever (a high standard even for humans).

And even if we were to agree that that's a reasonable standard, GPT 5 shouldn't be included. There is only one datapoint for all OpenAI models. That data point more indicative of the performance of OpenAI models (and the harness used) than of any progression. Once you exclude it it matches what you would expect from a logistic model. Improvements have slowed down, but not stopped

1: https://metr.org/assets/images/many-swe-bench-passing-prs-wo...

utopiah 2026-03-12 12:59 UTC link
I gave up on trying months ago, you can see the timeline on top of https://fabien.benetou.fr/Content/SelfHostingArtificialIntel...

Truth is I'm probably wrong. I should keep on testing ... but at the same time I precisely gave up because I didn't think the trend was fast enough to keep on investing on checking it so frequently. Now I just read this kind of post, ask around (mainly arguing with comments asking for genuine examples that should be "surprising" and kept on being disappointed) and that seems to be enough for a proxy.

I should though, as I mentioned in another comment, keep track of failed attempts.

PS: I check solely on self-hosted models (even if not on my machine but least on machines I could setup) because I do NOT trust the scaffolding around proprietary closed sources models. I can't verify that nobody is in the loop.

sd9 2026-03-12 13:01 UTC link
You really can't model these 5 data points with a linear regression or a step function. The models are of different sizes / use cases, and from two different labs. I feel like what we've observed generally is that different labs releasing similarly sized models at similar times are generally pretty similar.

I think the only reasonable thing to read into is Sonnet 3.5 -> 3.7 -> 4.5. But yeah, you just can't draw a line through this thing.

I will die on the hill that LLMs are getting better, particularly Anthropic's releases since December. But I can't point at a graph to prove that, I'm just drawing on my personal experience. I do use Claude Code though, so I think a large part of the improvement comes from the harness.

Havoc 2026-03-12 13:02 UTC link
As they become more capable peoples commits will also become more ambitious.

So I’d say fairly flat commit acceptance numbers make sense even in the context of improving LLMs

aerhardt 2026-03-12 13:04 UTC link
I feel that two things are true at the same time:

1) Something happened during 2025 that made the models (or crucially, the wrapping terminal-based apps like Claude Code or Codex) much better. I only type in the terminal anymore.

2) The quality of the code is still quite often terrible. Quadruple-nested control flow abounds. Software architecture in rather small scopes is unsound. People say AI is “good at front end” but I see the worst kind of atrocities there (a few days ago Codex 5.3 tried to inject a massive HTML element with a CSS before hack, rather than proprerly refactoring markup)

Two forces feel true simultaneously but in permanent tension. I still cannot make out my mind and see the synthesis in the dialectic, where this is truly going, if we’re meaningfully moving forward or mostly moving in circles.

orwin 2026-03-12 13:04 UTC link
I think what happened with static image generation is happening with LLMs. Basically the tools around are becoming better, but all the AI improvements stall, the error rate stay the same (but external tools curate the results so it won't be noticeable if you don't run your own model), the accuracy is still slightly improving, but slower and slower, and never reach the 'perfect' point. Basically stablediffusion early 2025
BoppreH 2026-03-12 13:31 UTC link
Controversial opinion from a casual user, but state-of-art LLMs now feel to me more intelligent then the average person on the steet. Also explains why training on more average-quality data (if there's any left) is not making improvements.

But LLMs are hamstrung by their harnesses. They are doing the equivalent of providing technical support via phone call: little to no context, and limited to a bidirectional stream of words (tokens). The best agent harnesses have the equivalent of vision-impairment accessibility interfaces, and even those are still subpar.

Heck, giving LLMs time to think was once a groundbreaking idea. Yesterday I saw Claude Code editing a file using shell redirects! It's barbaric.

I expect future improvements to come from harness improvements, especially around sub agents/context rollbacks (to work around the non-linear cost of context) and LLM-aligned "accessibility tools". That, or more synthetic training data.

Incipient 2026-03-12 16:09 UTC link
I feel even if the models are stagnating, the tooling around them, and the integrations and harnesses they have are getting significantly more capable (if not always 'better' - the recent vscode update really handicapped them for some reason). Things like the new agent from booking.com or whatever, if it could integrate with all hotels, activities, mapping tools, flight system, etc could be hugely powerful.

Assuming we get no better than opus 4.6, they're very capable. Even if they make up nonsense 5% of the time!

globular-toast 2026-03-13 06:51 UTC link
I reckon LLM merge rates will go up, but not necessarily due to quality improvements. Instead I think maintainers will just become fatigued. The amount of code I'm expected to review now is way higher than before. And while I'm reviewing you know more is being generated. I'm sure I've let through more crap due to this fatigue attack on me.
nkozyra 2026-03-12 12:33 UTC link
The problem with evals is the underlying rubric will always be either subjective, or a quantitative score based on something that is likely now baked into the training set directly.

You kind of have to go on "feels" for a lot of this.

dwedge 2026-03-12 12:34 UTC link
Without meaning to sound dismissive, because I'm really not intending to, there's also the possibility that you've gotten worse after enough time using them. You're treating yourself as a constant in this, but man cannot walk in the same river twice.
jeremyjh 2026-03-12 12:35 UTC link
This has been the general consensus for about three years now. "Drastic increases in capability have happened the last 3-6 months" have been a constant refrain.

Without any data from the study past September I think its not unreasonable, if you want to make an argument based on evidence.

For me personally, I agree with you, I'm really seeing it as well.

Toutouxc 2026-03-12 12:35 UTC link
There's a consensus that SOMETHING changed with Opus 4.5. It might have been the "merge rates" metric, it might have not.

I'm certainly getting faster and cleaner-looking solutions for certain issues on Opus 4.6 than I was 5 months ago, but I'm not sure about the ability to solve (or even weigh in) the actual hard stuff, i.e. the stuff I'm paid for.

And I'm definitely not sure about the supposed big step between 4.5 and 4.6. I'm literally not seeing any.

postflopclarity 2026-03-12 12:40 UTC link
> but mostly in skills and token use optimization.

I have heard rumors that token use optimization has been a recent focus to try to tidy up the financials of these companies before they IPO. take that with a grain of salt though

cj 2026-03-12 12:40 UTC link
I agree with your sentiment, but I think we've yet to see the full application of the current technology. (Even if LLMs themselves don't improve, there's significant opportunity for people to use it in ways not currently being done)
saulpw 2026-03-12 12:45 UTC link
After only 3 months (!) you can claim a plateau, but not a wall.
AussieWog93 2026-03-12 12:55 UTC link
I'd agree with you on 4.5 to 4.6, but going from gpt-5 or 4.0 to 4.5 was night and day.
roxolotl 2026-03-12 13:05 UTC link
I don't know that graph to me shows Sonnet 4.5 as worse than 3.7. Maybe the automated grader is finding code breakages in 3.7 and not breaking that out? But I'd much prefer to add code that is a different style to my codebase than code that breaks other code. But even ignoring that the pass rate is almost identical between the two models.
orwin 2026-03-12 13:08 UTC link
> People say AI is “good at front end”

I only say that because I'm a shit frontend dev. Honestly, I'm not that bad anymore, but I'm still shit, and the AI will probably generate better code than I will.

GaggiX 2026-03-12 13:12 UTC link
Image quality has improved a lot in recent months thanks to better models. The ability of people to notice these improvements is plateauing because they are not trained to spot artifacts, which are becoming more obscure.
yorwba 2026-03-12 13:14 UTC link
Yes, I think this is basically an instance of the "emergent abilities mirage." https://arxiv.org/abs/2304.15004

If you measure completion rate on a task where a single mistake can cause a failure, you won't see noticeable improvements on that metric until all potential sources of error are close to being eliminated, and then if they do get eliminated it causes a sudden large jump in performance.

That's fine if you just want to know whether the current state is good enough on your task of choice, but if you also want to predict future performance, you need to break it down into smaller components and track each of them individually.

Zababa 2026-03-12 13:17 UTC link
I think it is important to try to find more rigorous things to test than the general sentiment of the people using the tools. If only because the more benchmarks we have the more we can improve models without regressions. METR is asking a really interesting question here, "are models improving at making one shot PRs?". The answer seems to be, yes, but slower than benchmarks suggest, if you look at the pass rate of different versions of Claude Sonnet. A reasonable answer is "you're not supposed to use them by making one shot PRs", but then ideally we would need to have some kind of standarized test for the ability of models to incorporate feedback and evolve PRs.
jygg4 2026-03-12 13:44 UTC link
The models lose the ability to inject subtle and nuance stuff as they scale up, is what I’ve observed.
xyzsparetimexyz 2026-03-12 13:49 UTC link
Steet? Do you mean street? They're smarter in the same way a search engine is smarter.
BoumTAC 2026-03-12 14:53 UTC link
It's because they are getting so good it's impossible to recognize them.

Haiku 4.5 is already so good it's ok for 80% (95%?) of dev tasks.

jygg4 2026-03-12 14:57 UTC link
Indeed. Why is this post down voted? There’s always trade-offs taking place, it’s good to call them out.
hrmtst93837 2026-03-12 21:14 UTC link
Focusing on flashy breakthroughs hides the issue that bigger models and merge benchmarks rarely translate to reliability in real codebases. For routine merges, subtle regressions and context quirks matter more than headline progress. Unless evals stress nasty scenarios like multi-file renames with tricky conflicts, the numbers are mostly for show. Progress will plateau until someone tunes for the boring, messy cases that waste dev time.
lich_king 2026-03-12 21:40 UTC link
> In practice this still doesn't mean 50 % of white collar can't be automated though.

Let me ask you this, though: if we wanted to, what percentage of white collar jobs could have been automated or eliminated prior to LLMs?

Meta has nearly 80k employees to basically run two websites and three mobile apps. There were 18k people working at LinkedIn! Many big tech companies are massive job programs with some product on the side. Administrative business partners, program managers, tech writers, "stewards", "champions", "advocates", 10-layer-deep reporting chains... engineers writing cafe menu apps and pet programming languages... a team working on in-house typefaces... the list goes on.

I can see AI producing shifts in the industry by reducing demand for meaningful work, but I doubt the outcome here is mass unemployment. There's an endless supply of bs jobs as long as the money is flowing.

8note 2026-03-12 22:17 UTC link
> But LLMs are hamstrung by their harnesses

entirely so. i think anthropic updated something about the compact algorithm recently, and its gone from working well over long times to basically garbage whenever a compact happens

mountainriver 2026-03-12 23:51 UTC link
Yeah same, and all my coworkers feel the same.

Most of us have been coding for ages. I actually find it really odd people keep trying to disprove things that are relatively obvious with LLMs

sumeno 2026-03-13 00:18 UTC link
This has basically been my experience since Sonnet 3.5. I've been working on a personal project on and off with various models and things since then and the biggest difference between then and now is that it will do larger chunks of work than it did before, but the quality of the code is not particularly better, I still have to do a lot of cleanup and it still goes off the rails pretty frequently. I have to do fewer individual prompts, but the time spent reviewing the code takes longer because I also have to mentally process and fix larger chunks of code too

Is it a better user experience now? Yes. Has it boosted my productivity on this project? Absolutely.

But it still needs a ton of hand holding for anything complicated and I still deal with tons of "OK, this bug is fixed now!" followed by manually confirming a bug still exists.

globular-toast 2026-03-13 07:02 UTC link
It's so disrespectful to say an LLM is more intelligent than a person on the street. The LLM has nothing at stake, cares not a sausage about the consequences of what it spits out. People have all kinds of pressures, dependants, and personal issues like health. Our thoughts and actions have real consequences. It's so easy to be intelligent when you're the pretend human that gets switched on for five minutes then switched off again.
leoedin 2026-03-13 10:06 UTC link
This matches my experience too. The models write code that would never pass a review normally. Mega functions, "copy and pasted" code with small changes, deep nested conditionals and loops. All the stuff we've spent a lot of time trying to minimise!

You could argue it's OK because a model can always fix it later. But the problem comes when there's subtle logic bugs and its basically impossible to understand. Or fixing the bug in one place doesn't fix it in the 10 other places almost the same code exists.

I strongly suspect that LLMs, like all technologies, are going to follow an S curve of capability. The question is where in that S curve we are right now.

zx8080 2026-03-13 10:08 UTC link
> People say AI is “good at front end” but I see the worst kind of atrocities there

It's commonly universal to say "AI is great in X", where one is not professional in X. It's because that's how AI is designed: to output tokens according to stats, not logic, not semantic, and not meaning: stats.

SkyPuncher 2026-03-14 03:16 UTC link
4.6 has been a very, very slight regression for me, but the tradeoff is they've added better compaction - and now larger context windows. That's a reasonable tradeoff for me.
Editorial Channel
What the content says
+0.30
Article 19 Freedom of Expression
Medium A: The author exercises freedom of expression to publish scientific critique and analysis.
Editorial
+0.30
SETL
+0.17

Content demonstrates active exercise of freedom of expression: author publishes analytical critique of peer-reviewed research, presents alternative data interpretations, and challenges prevailing claims about LLM capabilities without apparent censorship or restriction.

ND
Preamble Preamble

Content does not directly address human dignity, freedom, or the philosophical foundations of human rights.

ND
Article 1 Freedom, Equality, Brotherhood

Content does not address equality, dignity, or inherent rights of persons.

ND
Article 2 Non-Discrimination

Content does not discuss non-discrimination, protected characteristics, or protected status.

ND
Article 3 Life, Liberty, Security

Content does not address rights to life, liberty, or personal security.

ND
Article 4 No Slavery

Content does not address slavery or servitude.

ND
Article 5 No Torture

Content does not address torture or cruel/inhuman/degrading treatment.

ND
Article 6 Legal Personhood

Content does not address legal personhood or recognition before law.

ND
Article 7 Equality Before Law

Content does not address equal protection under law.

ND
Article 8 Right to Remedy

Content does not address remedies for human rights violations.

ND
Article 9 No Arbitrary Detention

Content does not address arbitrary arrest or detention.

ND
Article 10 Fair Hearing

Content does not address fair trial or impartial hearing.

ND
Article 11 Presumption of Innocence

Content does not address criminal responsibility, presumption of innocence, or ex post facto laws.

ND
Article 12 Privacy

Content does not address privacy, family, home, or correspondence.

ND
Article 13 Freedom of Movement

Content does not address freedom of movement or internal displacement.

ND
Article 14 Asylum

Content does not address asylum or refuge.

ND
Article 15 Nationality

Content does not address nationality or citizenship.

ND
Article 16 Marriage & Family

Content does not address marriage, family, or property rights.

ND
Article 17 Property

Content does not address property rights or ownership.

ND
Article 18 Freedom of Thought

Content does not address freedom of thought, conscience, or religion.

ND
Article 20 Assembly & Association

Content does not address freedom of peaceful assembly or association.

ND
Article 21 Political Participation

Content does not address political participation or democratic processes.

ND
Article 22 Social Security

Content does not address social security or labor rights.

ND
Article 23 Work & Equal Pay

Content does not address work, employment, fair wages, or labor conditions.

ND
Article 24 Rest & Leisure

Content does not address rest, leisure, or working hours.

ND
Article 25 Standard of Living

Content does not address health, nutrition, housing, or social services.

ND
Article 26 Education

Content does not address education or educational access.

ND
Article 27 Cultural Participation

Content does not address cultural participation or intellectual property.

ND
Article 28 Social & International Order

Content does not address social or international order necessary for rights realization.

ND
Article 29 Duties to Community

Content does not address duties or limitations on rights exercise.

ND
Article 30 No Destruction of Rights

Content does not prohibit activity aimed at destroying rights or freedoms.

Structural Channel
What the site does
Element Modifier Affects Note
Legal & Terms
Privacy
No privacy policy or data collection mechanisms observable on URL.
Terms of Service
No terms of service observable on URL.
Identity & Mission
Mission
No explicit mission statement observable; appears to be personal analytical blog.
Editorial Code
No editorial guidelines observable.
Ownership
Author name 'Entropicthoughts' suggests personal/independent authorship.
Access & Distribution
Access Model
Content appears freely accessible; no paywall or login requirement.
Ad/Tracking
No advertising or tracking mechanisms observable on page.
Accessibility
Content renders in plain HTML; no apparent accessibility barriers detected.
+0.20
Article 19 Freedom of Expression
Medium A: The author exercises freedom of expression to publish scientific critique and analysis.
Structural
+0.20
Context Modifier
0.00
SETL
+0.17

Website permits publication of contrarian viewpoints without apparent content moderation, paywall, or login restrictions. Content remains freely accessible.

ND
Preamble Preamble

Website structure offers no observable mechanisms that substantively enable or restrict human rights exercise.

ND
Article 1 Freedom, Equality, Brotherhood

No structural mechanisms observable that relate to equal treatment or dignity affirmation.

ND
Article 2 Non-Discrimination

No observable structural discrimination or protections related to enumerated characteristics.

ND
Article 3 Life, Liberty, Security

No structural mechanisms that affect life, liberty, or security.

ND
Article 4 No Slavery

No structural mechanisms related to labor coercion or servitude.

ND
Article 5 No Torture

No structural mechanisms that facilitate or impede protection from torture.

ND
Article 6 Legal Personhood

No observable structural impact on legal recognition.

ND
Article 7 Equality Before Law

No structural mechanisms observable that facilitate or obstruct equal legal protection.

ND
Article 8 Right to Remedy

No observable structural mechanisms affecting recourse or remedy.

ND
Article 9 No Arbitrary Detention

No structural mechanisms related to arrest or detention.

ND
Article 10 Fair Hearing

No structural mechanisms observable affecting judicial process.

ND
Article 11 Presumption of Innocence

No structural mechanisms related to criminal justice.

ND
Article 12 Privacy

No personal data collection or privacy violations observable on page; domain-level assessment shows no privacy policy but also no data collection.

ND
Article 13 Freedom of Movement

No structural mechanisms affecting freedom of movement.

ND
Article 14 Asylum

No structural mechanisms observable affecting asylum or refuge.

ND
Article 15 Nationality

No structural mechanisms observable affecting citizenship rights.

ND
Article 16 Marriage & Family

No structural mechanisms observable affecting family or property rights.

ND
Article 17 Property

No structural mechanisms observable affecting property rights.

ND
Article 18 Freedom of Thought

No structural mechanisms observable affecting conscience or religion.

ND
Article 20 Assembly & Association

No structural mechanisms observable affecting assembly or association.

ND
Article 21 Political Participation

No structural mechanisms observable affecting political participation.

ND
Article 22 Social Security

No structural mechanisms observable affecting social protection.

ND
Article 23 Work & Equal Pay

No structural mechanisms observable affecting labor rights or employment.

ND
Article 24 Rest & Leisure

No structural mechanisms observable affecting rest or recreation.

ND
Article 25 Standard of Living

No structural mechanisms observable affecting health or welfare rights.

ND
Article 26 Education

Website presents technical content without apparent accessibility barriers, but structure does not substantively address educational rights or access.

ND
Article 27 Cultural Participation

No structural mechanisms observable affecting cultural or intellectual property rights.

ND
Article 28 Social & International Order

No structural mechanisms observable affecting social or international order.

ND
Article 29 Duties to Community

No structural mechanisms observable that enforce or establish community duties.

ND
Article 30 No Destruction of Rights

No structural mechanisms observable that would prevent exercise of Article 30.

Supplementary Signals
How this content communicates, beyond directional lean. Learn more
Epistemic Quality
How well-sourced and evidence-based is this content?
0.79 medium claims
Sources
0.8
Evidence
0.8
Uncertainty
0.8
Purpose
0.8
Propaganda Flags
No manipulative rhetoric detected
0 techniques detected
Emotional Tone
Emotional character: positive/negative, intensity, authority
measured
Valence
-0.2
Arousal
0.4
Dominance
0.5
Transparency
Does the content identify its author and disclose interests?
0.33
✗ Author
More signals: context, framing & audience
Solution Orientation
Does this content offer solutions or only describe problems?
0.28 problem only
Reader Agency
0.4
Stakeholder Voice
Whose perspectives are represented in this content?
0.45 3 perspectives
Speaks: individuals
About: corporationinstitution
Temporal Framing
Is this content looking backward, at the present, or forward?
present short term
Geographic Scope
What geographic area does this content cover?
global
Complexity
How accessible is this content to a general audience?
moderate medium jargon domain specific
Longitudinal 597 HN snapshots · 210 evals
+1 0 −1 HN
Audit Trail 230 entries
2026-03-16 00:48 eval_success PSQ evaluated: g-PSQ=0.280 (3 dims) - -
2026-03-16 00:47 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-16 00:44 eval_success Lite evaluated: Neutral (-0.08) - -
2026-03-16 00:44 model_divergence Cross-model spread 0.34 exceeds threshold (2 models) - -
2026-03-16 00:44 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-16 00:44 rater_validation_warn Lite validation warnings for model llama-4-scout-wai: 1W 0R - -
2026-03-15 22:53 eval_success Evaluated: Mild positive (0.26) - -
2026-03-15 22:53 model_divergence Cross-model spread 0.34 exceeds threshold (2 models) - -
2026-03-15 22:53 eval Evaluated by claude-haiku-4-5-20251001: +0.26 (Mild positive) 10,387 tokens 0.00
2026-03-15 22:49 eval_success Evaluated: Mild positive (0.26) - -
2026-03-15 22:49 model_divergence Cross-model spread 0.34 exceeds threshold (2 models) - -
2026-03-15 22:49 eval Evaluated by claude-haiku-4-5-20251001: +0.26 (Mild positive) 10,264 tokens
2026-03-15 22:22 eval_success PSQ evaluated: g-PSQ=0.280 (3 dims) - -
2026-03-15 22:22 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 21:54 eval_success Lite evaluated: Neutral (-0.08) - -
2026-03-15 21:54 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 21:54 rater_validation_warn Lite validation warnings for model llama-4-scout-wai: 1W 0R - -
2026-03-15 18:33 eval_success Lite evaluated: Neutral (-0.08) - -
2026-03-15 18:33 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 18:33 rater_validation_warn Lite validation warnings for model llama-4-scout-wai: 1W 0R - -
2026-03-15 18:20 eval_success PSQ evaluated: g-PSQ=0.280 (3 dims) - -
2026-03-15 18:20 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 17:22 eval_success Lite evaluated: Neutral (-0.08) - -
2026-03-15 17:22 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 17:22 rater_validation_warn Lite validation warnings for model llama-4-scout-wai: 1W 0R - -
2026-03-15 17:02 eval_success PSQ evaluated: g-PSQ=0.280 (3 dims) - -
2026-03-15 17:02 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 16:09 eval_success Lite evaluated: Neutral (-0.08) - -
2026-03-15 16:09 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 16:09 rater_validation_warn Lite validation warnings for model llama-4-scout-wai: 1W 0R - -
2026-03-15 15:48 eval_success PSQ evaluated: g-PSQ=0.280 (3 dims) - -
2026-03-15 15:48 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 15:34 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 15:06 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 14:57 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 14:29 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 14:22 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 13:49 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 13:45 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 13:09 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) +0.16
2026-03-15 13:07 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 12:30 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 12:30 eval Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) -0.16
2026-03-15 11:52 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 11:48 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 11:13 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 11:06 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 10:31 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 10:25 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 09:52 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 09:41 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 09:10 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 08:58 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 08:29 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 08:16 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 07:46 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 07:29 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 07:04 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 06:49 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 06:30 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 06:11 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 05:54 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 05:34 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 05:19 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 04:56 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 04:42 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 04:19 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 04:07 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 03:41 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 03:32 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 03:02 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 02:54 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 02:24 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 02:17 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 01:44 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 01:42 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 01:13 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 01:12 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-15 00:43 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-15 00:38 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 23:36 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 23:27 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 22:56 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 22:48 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 21:53 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 21:46 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 20:35 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) +0.16
2026-03-14 20:28 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 19:39 eval Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) -0.16
2026-03-14 19:31 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 18:33 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) +0.16
2026-03-14 18:31 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 16:55 eval Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) -0.16
2026-03-14 16:54 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 15:47 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 15:46 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 15:03 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 15:01 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 14:26 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 14:24 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 13:49 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 13:46 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 13:12 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 13:10 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 12:36 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 12:32 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 12:00 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 11:57 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 11:25 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 11:23 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 10:50 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 10:47 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 10:12 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 10:10 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 09:34 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 09:31 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 08:54 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 08:49 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 08:11 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 08:09 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 07:32 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 07:29 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 06:51 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 06:49 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 06:09 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 06:06 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 05:31 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 05:29 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 04:54 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 04:51 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 04:15 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 04:11 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 03:38 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 03:33 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 02:56 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 02:52 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 02:16 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 02:12 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 01:39 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 01:34 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 01:04 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 01:01 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-14 00:37 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-14 00:33 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 23:36 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 23:16 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 22:31 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 21:57 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 21:07 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 20:57 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 19:59 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 19:47 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 18:35 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 18:22 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 17:21 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 16:48 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 15:54 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 15:42 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 15:18 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 15:02 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 14:40 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 14:19 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 13:57 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 13:40 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 13:22 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 13:06 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 12:47 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 12:30 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 12:12 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 11:55 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 11:35 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 11:19 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 10:57 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 10:40 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 10:19 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 10:02 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 09:40 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 09:21 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 09:02 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 08:43 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 08:26 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 08:05 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 07:45 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 07:24 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 07:05 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 06:43 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 06:27 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 06:07 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) +0.16
2026-03-13 05:51 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 05:32 eval Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) 0.00
2026-03-13 05:17 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 04:56 eval Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) -0.16
2026-03-13 04:42 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 04:19 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 04:06 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 03:44 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 03:29 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 03:09 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 02:54 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 02:34 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 02:19 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 01:59 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 01:44 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) +0.16
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 01:24 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 01:16 eval Evaluated by llama-4-scout-wai: -0.24 (Mild negative) -0.16
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 00:54 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-13 00:48 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-13 00:09 eval Evaluated by llama-3.3-70b-wai-psq: +0.17 (Mild positive)
2026-03-13 00:05 eval Evaluated by llama-3.3-70b-wai: -0.08 (Neutral)
reasoning
Technical analysis of LLM improvement
2026-03-12 23:54 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 23:40 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 22:42 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) +0.16
2026-03-12 22:28 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 21:57 eval Evaluated by llama-4-scout-wai-psq: +0.12 (Mild positive) -0.16
2026-03-12 21:46 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 21:16 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 21:15 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 20:31 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 19:24 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 18:50 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 17:54 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 17:11 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 16:27 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 15:49 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 15:03 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 14:18 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 13:44 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive) 0.00
2026-03-12 13:39 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral) 0.00
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion
2026-03-12 13:02 eval Evaluated by llama-4-scout-wai-psq: +0.28 (Mild positive)
2026-03-12 12:59 eval Evaluated by llama-4-scout-wai: -0.08 (Neutral)
reasoning
Technical blog post discussing LLM performance with no explicit human rights discussion