-0.01 We gave terabytes of CI logs to an LLM (www.mendral.com S:+0.08 )
225 points by shad42 2 days ago | 107 comments on HN | Neutral Editorial · v3.7 · 2026-02-28 12:37:42 0
Summary Labor Automation & Privacy Neutral
A technical blog post from Mendral describing their LLM-powered CI debugging system architecture, focused on SQL querying, data compression, and operational patterns. While strong on technical education and public information sharing, the article conspicuously omits discussion of privacy implications in large-scale metadata collection (5.31 TiB uncompressed) and labor considerations in workplace automation, representing gaps by silence rather than explicit opposition to rights.
Article Heatmap
Preamble: -0.05 — Preamble P Article 1: ND — Freedom, Equality, Brotherhood Article 1: No Data — Freedom, Equality, Brotherhood 1 Article 2: ND — Non-Discrimination Article 2: No Data — Non-Discrimination 2 Article 3: ND — Life, Liberty, Security Article 3: No Data — Life, Liberty, Security 3 Article 4: ND — No Slavery Article 4: No Data — No Slavery 4 Article 5: ND — No Torture Article 5: No Data — No Torture 5 Article 6: ND — Legal Personhood Article 6: No Data — Legal Personhood 6 Article 7: ND — Equality Before Law Article 7: No Data — Equality Before Law 7 Article 8: ND — Right to Remedy Article 8: No Data — Right to Remedy 8 Article 9: ND — No Arbitrary Detention Article 9: No Data — No Arbitrary Detention 9 Article 10: ND — Fair Hearing Article 10: No Data — Fair Hearing 10 Article 11: ND — Presumption of Innocence Article 11: No Data — Presumption of Innocence 11 Article 12: -0.20 — Privacy 12 Article 13: ND — Freedom of Movement Article 13: No Data — Freedom of Movement 13 Article 14: ND — Asylum Article 14: No Data — Asylum 14 Article 15: ND — Nationality Article 15: No Data — Nationality 15 Article 16: ND — Marriage & Family Article 16: No Data — Marriage & Family 16 Article 17: ND — Property Article 17: No Data — Property 17 Article 18: ND — Freedom of Thought Article 18: No Data — Freedom of Thought 18 Article 19: +0.10 — Freedom of Expression 19 Article 20: ND — Assembly & Association Article 20: No Data — Assembly & Association 20 Article 21: ND — Political Participation Article 21: No Data — Political Participation 21 Article 22: ND — Social Security Article 22: No Data — Social Security 22 Article 23: -0.15 — Work & Equal Pay 23 Article 24: ND — Rest & Leisure Article 24: No Data — Rest & Leisure 24 Article 25: ND — Standard of Living Article 25: No Data — Standard of Living 25 Article 26: +0.13 — Education 26 Article 27: +0.08 — Cultural Participation 27 Article 28: ND — Social & International Order Article 28: No Data — Social & International Order 28 Article 29: ND — Duties to Community Article 29: No Data — Duties to Community 29 Article 30: ND — No Destruction of Rights Article 30: No Data — No Destruction of Rights 30
Negative Neutral Positive No Data
Aggregates
Editorial Mean -0.01 Structural Mean +0.08
Weighted Mean -0.01 Unweighted Mean -0.01
Max +0.13 Article 26 Min -0.20 Article 12
Signal 6 No Data 25
Volatility 0.13 (Medium)
Negative 3 Channels E: 0.6 S: 0.4
SETL +0.05 Editorial-dominant
FW Ratio 50% 11 facts · 11 inferences
Evidence 10% coverage
5M 1L 25 ND
Theme Radar
Foundation Security Legal Privacy & Movement Personal Expression Economic & Social Cultural Order & Duties Foundation: -0.05 (1 articles) Security: 0.00 (0 articles) Legal: 0.00 (0 articles) Privacy & Movement: -0.20 (1 articles) Personal: 0.00 (0 articles) Expression: 0.10 (1 articles) Economic & Social: -0.15 (1 articles) Cultural: 0.11 (2 articles) Order & Duties: 0.00 (0 articles)
HN Discussion 19 top-level · 23 replies
verdverm 2026-02-27 15:53 UTC link
This is one of those HN posts you share internally in the hopes you can work this into your sprint
sollewitt 2026-02-27 16:05 UTC link
But does it work? I’ve used LLMs for log analysis and they have been prone to hallucinate reasons: depending on the logs the distance between cause and effects can be larger than context, usually we’re dealing with multiple failures at once for things to go badly wrong, and plenty of benign issues throw scary sounding errors.
dbreunig 2026-02-27 16:10 UTC link
Check out “Recursive Language Models”, or RLMs.

I believe this method works well because it turns a long context problem (hard for LLMs) into a coding and reasoning problem (much better!). You’re leveraging the last 18 months of coding RL by changing you scaffold.

Yizahi 2026-02-27 16:33 UTC link
We have an ongoing effort in parsing logs for our autotests to speed up debug. It is vary hard to do, mainly because there is a metric ton of false positives or plain old noise even in the info logs. Tracing the culprit can be also tricky, since an error in container A can be caused by the actual failure in the container B which may in turn depend on something entirely else, including hardware problems.

Basically a surefire way to train LLM to parse logs and detect real issues almost entirely depends on the readability and precision of logging. And if logging is good enough then humans can do debug faster and more reliable too :) . Unfortunately people reading logs and people coding them are almost not intersecting in practice and so the issue remains.

sathish316 2026-02-27 16:34 UTC link
SQL is the best exploratory interface for LLMs. But, most of Observability data like Metrics, Logs, Traces we have today are hidden in layers of semantics, custom syntax that’s hard for an agent to translate from explore or debug intent to the actual query language.

Large scale data like metrics, logs, traces are optimised for storage and access patterns and OLAP/SQL systems may not be the most optimal way to store or retrieve it. This is one of the reasons I’ve been working on a Text2SQL / Intent2SQL engine for Observability data to let an agent explore schema, semantics, syntax of any metrics, logs data. It is open sourced as Codd Text2SQL engine - https://github.com/sathish316/codd_query_engine/

It is far from done and currently works for Prometheus,Loki,Splunk for few scenarios and is open to OSS contributions. You can find it in action used by Claude Code to debug using Metrics and Logs queries:

Metric analyzer and Log analyzer skills for Claude code - https://github.com/sathish316/precogs_sre_oncall_skills/tree...

p0w3n3d 2026-02-27 17:09 UTC link
That's in the contrary to my experience. Logs contain a lot of noise and unnecessary information, especially Java, hence best is to prepare them before feeding them to LLM. Not speaking about wasted tokens too...
buryat 2026-02-27 17:37 UTC link
I just wrote a tool for reducing logs for LLM analysis (https://github.com/ascii766164696D/log-mcp)

Lots of logs contain non-interesting information so it easily pollutes the context. Instead, my approach has a TF-IDF classifier + a BERT model on GPU for classifying log lines further to reduce the number of logs that should be then fed to a LLM model. The total size of the models is 50MB and the classifier is written in Rust so it allows achieve >1M lines/sec for classifying. And it finds interesting cases that can be missed by simple grepping

I trained it on ~90GB of logs and provide scripts to retrain the models (https://github.com/ascii766164696D/log-mcp/tree/main/scripts)

It's meant to be used with Claude Code CLI so it could use these tools instead of trying to read the log files

pphysch 2026-02-27 17:40 UTC link
"Logs" is doing some heavy lifting here. There's a very non-trivial step in deciding that a particular subset and schema of log messages deserves to be in its own columnar data table. It's a big optimization decision that adds complexity to your logging stack. For a narrow SaaS product that is probably a no-brainer.

I would like to see this approach compared to a more minimal approach with say, VictoriaLogs where the LLM is taught to use LogsQL, but overall it's a more "out of the box" architecture.

PaulHoule 2026-02-27 18:12 UTC link
My first take is that you could have 10 TB of logs with just a few unique lines that are actually interesting. So I am not thinking "Wow, what impressive big data you have there" but rather "if you have an accuracy of 1-10^-6 you are still are overwhelmed with false positives" or "I hope your daddy is paying for your tokens"
tehjoker 2026-02-27 18:16 UTC link
Interesting article, but there's no rate of investigation success quoted. The engineering is interested, but it's hard to know if there was any point without some kind of measure of the usefulness.
iririririr 2026-02-27 19:03 UTC link
am i reading correctly that the compression is just a relational records? i.e. omit the pr title, just point to it?
the_arun 2026-02-27 19:11 UTC link
The article doesn't mention about which LLM or total cost. Because if they have used ChatGPT or such, the token cost itself should be very expensive, right?
esafak 2026-02-27 19:35 UTC link
Forgive me if this is tangential to the debate, but I am trying to understand Mendral's value proposition. Is it that you save users time in setting up observability for CI? Otherwise could you not simply use gh to fetch the logs, their observability system's API or MCP, and cross check both against the code? Or is there a machine learning system that analyzes these inputs beyond merely retrieving context for the LLM? Good luck!
_boffin_ 2026-02-27 20:53 UTC link
Excited to go through this!
TKAB 2026-02-27 21:19 UTC link
That post reads like fully LLM-generated. It's basically boasting a list of numbers that are supposed to sound impressive. If there's a coherent story, it's well hidden.
gabeh 2026-02-27 21:58 UTC link
SQL has always been my favorite "loaded gun" api. If you have a control plane of RLS + role based auth and you've got a data dictionary it is trivial to get to a data explorer chat interaction with an LLM doing the heavy lifting.
Jerry2 2026-02-28 02:03 UTC link
Which LLM did they use? I'd really like to learn some details behind how they setup the whole pipeline. Anyone know? Thanks!
Noumenon72 2026-02-28 02:06 UTC link
Are you affiliated with Inngest at all? Great ad for them.
rurban 2026-02-28 09:43 UTC link
I gave lots of prolog rules to analyze log files of a complicated distributed system with 20 realtime components to find problems and root causes. Worked really well. In 2008 or so

Cannot believe that LLM are that useful. When ever a component changes or adds a log line, you edit one rule. With an LLM you need weeks of new logs and then weeks to retrain. And a high budget for the H100`s

koakuma-chan 2026-02-27 16:16 UTC link
This seems really weird to me. Isn't that just using LLMs in a specific way? Why come up with a new name "RLM" instead of saying "LLM"? Nothing changes about the model.
verdverm 2026-02-27 16:27 UTC link
It can, like all the other tasks, it's not magic and you need to make the job of the agent easier by giving it good instructions, tools, and environments. It's exactly the same thing that makes the life of humans easier too.

This post is a case study that shows one way to do this for a specific task. We found an RCA to a long-standing problem with our dev boxes this week using Ai. I fed Gemini Deep Research a few logs and our tech stack, it came back with an explanation of the underlying interactions, debugging commands, and the most likely fix. It was spot on, GDR is one of the best debugging tools for problems where you don't have full understanding.

If you are curious, and perhaps a PSA, the issue was that Docker and Tailscale were competing on IP table updates, and in rare circumstances (one dev, once every few weeks), Docker DNS would get borked. The fix is to ignore Docker managed interfaces in NetworkManager so Tailscale stops trying to do things with them.

shad42 2026-02-27 16:32 UTC link
Mendral co-founder here, we built this infra to have our agent detect CI issues like flaky tests and fix them. Observing logs are useful to detect anomalies but we also use those to confirm a fix after the agent opens a PR (we have long coding sessions that verifies a fixe and re-run the CI if needed, all in the same agent loop).

So yes it works, we have customers in production.

aluzzardi 2026-02-27 16:36 UTC link
Post author here.

Yes, it works really well.

1) The latest models are radically better at this. We noticed a massive improvement in quality starting with Sonnet 4.5

2) The context issue is real. We solve this by using sub agents that read through logs and return only relevant bits to the parent agent’s context

shad42 2026-02-27 17:01 UTC link
Yeah it sounds very familiar with what we went through while building this agent. We're focused on CI logs for now because we wanted something that works really well for things like flaky tests, but planning to expand the context to infrastructure logs very soon.
testbjjl 2026-02-27 17:01 UTC link
> SQL is the best exploratory interface for LLMs

Any qualifiers here from your experience or documentation?

shad42 2026-02-27 17:26 UTC link
LLMs are better now at pulling the context (as opposed to feeding everything you can inside the prompt). So you can expose enough query primitives to the LLM so it's able to filter out the noise.

I don't think implementing filtering on log ingestion is the right approach, because you don't know what is noise at this stage. We spent more time on thinking about the schema and indexes to make sure complex queries perform at scale.

kburman 2026-02-27 17:33 UTC link
Honestly, with recent models, these types of tasks are very much possible. Now it mostly depends on whether you are using the model correctly or not.
aluzzardi 2026-02-27 17:55 UTC link
Mendral co-founder here and author of the post.

This is an interesting approach. I definitely agree with the problem statement: if the LLM has to filter by error/fatal because of context window constraints, it will miss crucial information.

We took a different approach: we have a main agent (opus 4.6) dispatching "log research" jobs to sub agents (haiku 4.5 which is fast/cheap). The sub agent reads a whole bunch of logs and returns only the relevant parts to the parent agent.

This is exactly how coding agents (e.g. Claude Code) do it as well. Except instead of having sub agents use grep/read/tail, they use plain SQL.

mr-karan 2026-02-27 17:57 UTC link
Agreed on SQL being the best exploratory interface for agents. I've been building Logchef[1], an open-source log viewer for ClickHouse, and found the same thing — when you give an LLM the table schema, it writes surprisingly good ClickHouse SQL. I support both a simpler DSL (LogchefQL, compiles to type-aware SQL on the backend) and raw SQL, and honestly raw SQL wins for the agent use case — more flexible, more training data in the corpus.

I took this a few steps further beyond the web UI's AI assistant. There's an MCP server[2] so any AI assistant (Claude Desktop, Cursor, etc.) can discover your log sources, introspect schemas, and query directly. And a Rust CLI[3] with syntax highlighting and `--output jsonl` for piping — which means you can write a skill[4] that teaches the agent to triage incidents by running `logchef query` and `logchef sql` in a structured investigation workflow (count → group → sample → pivot on trace_id).

The interesting bit is this ends up very similar to what OP describes — an agent that iteratively queries logs to narrow down root cause — except it's composable pieces you self-host rather than an integrated product.

[1] https://github.com/mr-karan/logchef

[2] https://github.com/mr-karan/logchef-mcp

[3] https://logchef.app/integration/cli/

[4] https://github.com/mr-karan/logchef/tree/main/.agents/skills...

ManuelKiessling 2026-02-27 17:58 UTC link
https://github.com/dx-tooling/platform-problem-monitoring-co... could have a useful approach, too: it finds patterns in log lines and gives you a summary in the sense of „these 500 lines are all technically different, but they are all saying the same“.
jcgrillo 2026-02-27 18:02 UTC link
Do you think it could do anything interesting with a highly compressed representation? CLP can apparently achieve 169x compression ratio:

https://github.com/y-scope/clp

https://www.uber.com/blog/reducing-logging-cost-by-two-order...

hinkley 2026-02-27 18:20 UTC link
I think there’s too many expectations around what logging is for and getting everyone on the same page is difficult.

Meanwhile stats have fewer expectations, and moving signal out of the logs into stats is a much much smaller battle to win. It can’t tell you everything, but what it can tell you is easier to make unambiguous.

Over time I got people to stop pulling up Splunk as an automatic reflex and start pulling up Grafana instead for triage.

masterj 2026-02-27 18:23 UTC link
> There's a very non-trivial step in deciding that a particular subset and schema of log messages deserves to be in its own columnar data table.

IIUC this is addressed with the ClickHouse JSON type which can promote individual fields in unstructured data into its own column: https://clickhouse.com/blog/a-new-powerful-json-data-type-fo...

Parquet is getting a VARIANT data type which can do the same thing (called "shredding") but in a standards-based way: https://parquet.apache.org/blog/2026/02/27/variant-type-in-a...

jcgrillo 2026-02-27 18:27 UTC link
Yeah this is my experience with logs data. You only actually care about O(10) lines per query, usually related by some correlation ID. Or, instead of searching you're summarizing by counting things. In that case, actually counting is important ;).

In this piece though--and maybe I need to read it again--I was under the impression that the LLM's "interface" to the logs data is queries against clickhouse. So long as the queries return sensibly limited results, and it doesn't go wild with the queries, that could address both concerns?

aluzzardi 2026-02-27 18:48 UTC link
Mendral co-founder and post author here.

I agree with your statement and explained in a few other comments how we're doing this.

tldr:

- Something happens that needs investigating

- Main (Opus) agent makes focused plan and spawns sub agents (Haiku)

- They use ClickHouse queries to grab only relevant pieces of logs and return summaries/patterns

This is what you would do manually: you're not going to read through 10 TB of logs when something happens; you make a plan, open a few tabs and start doing narrow, focused searches.

aluzzardi 2026-02-27 19:33 UTC link
There are 2 layers of compression:

- ZSTD (actual data compression)

- De-duplication (i.e. what you're saying)

Although AFAIK it's not "just point to it" but rather storing sorted data and being able to say "the next 2M rows have the same PR Title"

shad42 2026-02-27 20:07 UTC link
There is a cost associated with each investigation (that the Mendral agent is doing). And we spend time tuning the orchestration between agents. Yes expensive but we're making money on top of what it costs us. So far we were able to take the cost down while increasing the relevance of each root cause analysis.

We're writing another post about that specifically, we'll publish it sometimes next week

shad42 2026-02-27 20:08 UTC link
Mendral is replacing a human Platform Engineer. It debugs the CI logs, look at the commit associated, look at the implementation of the tests, etc... It then proposes fixes and takes care of opening a PR.

We wrote about how this works for PostHog: https://www.mendral.com/blog/ci-at-scale

shad42 2026-02-27 21:36 UTC link
We did not want to make the post engineering-focused, but we have 18 companies in production today (we wrote about PostHog in the blog). At some point we should post some case studies. The metric we track for usefulness is our monthly revenue :)
shad42 2026-02-27 22:59 UTC link
100% and LLMs have tons of related training data
aluzzardi 2026-02-28 04:25 UTC link
It started with Sonnet 4.0 as a single agent and now it’s a mix of Opus 4.6 and Haiku 4.5 agents.

Opus plans the investigation and orchestrates the searches.

Haiku is the one actually querying ClickHouse and returning relevant bits

shad42 2026-02-28 05:04 UTC link
In some ways: we use their product and they use Mendral
Editorial Channel
What the content says
+0.15
Article 26 Education
Medium Advocacy
Editorial
+0.15
SETL
+0.09

Content provides extensive technical education accessible to anyone: denormalization rationale, compression algorithms, query optimization patterns, rate-limiting strategies, and durable execution design. Detailed explanations enable readers to learn and apply these principles.

+0.10
Article 19 Freedom of Expression
Medium Advocacy
Editorial
+0.10
SETL
0.00

Content shares technical knowledge publicly, consistent with free expression and information dissemination. Authors openly publish engineering insights, metrics, and architectural patterns.

+0.10
Article 27 Cultural Participation
Medium Advocacy
Editorial
+0.10
SETL
+0.07

Authors openly share technical and intellectual culture (engineering practices, architectural approaches, decade of CI systems experience). Work is attributed by name to Andrea Luzzardi and Mendral.

-0.05
Preamble Preamble
Low
Editorial
-0.05
SETL
ND

Content does not engage with human dignity, freedom, or universal rights themes. Focus is purely technical product architecture.

-0.15
Article 23 Work & Equal Pay
Medium Framing
Editorial
-0.15
SETL
ND

Content celebrates automation of engineering investigation work ('an always-on AI DevOps engineer that diagnoses CI failures, catches flaky tests, and fixes them') without discussing labor implications, worker displacement, fair wages, or job impact of workplace automation.

-0.20
Article 12 Privacy
Medium Framing
Editorial
-0.20
SETL
ND

Content celebrates massive data collection (5.31 TiB uncompressed, 48 metadata columns per log line, 300M+ daily lines) without discussing privacy governance, consent, or privacy safeguards. The framing emphasizes technical achievement while omitting privacy rights considerations.

ND
Article 1 Freedom, Equality, Brotherhood

No engagement.

ND
Article 2 Non-Discrimination

No engagement.

ND
Article 3 Life, Liberty, Security

No engagement.

ND
Article 4 No Slavery

No engagement.

ND
Article 5 No Torture

No engagement.

ND
Article 6 Legal Personhood

No engagement.

ND
Article 7 Equality Before Law

No engagement.

ND
Article 8 Right to Remedy

No engagement.

ND
Article 9 No Arbitrary Detention

No engagement.

ND
Article 10 Fair Hearing

No engagement.

ND
Article 11 Presumption of Innocence

No engagement.

ND
Article 13 Freedom of Movement

No engagement.

ND
Article 14 Asylum

No engagement.

ND
Article 15 Nationality

No engagement.

ND
Article 16 Marriage & Family

No engagement.

ND
Article 17 Property

No engagement.

ND
Article 18 Freedom of Thought

No engagement.

ND
Article 20 Assembly & Association

No engagement.

ND
Article 21 Political Participation

No engagement.

ND
Article 22 Social Security

No engagement.

ND
Article 24 Rest & Leisure

No engagement.

ND
Article 25 Standard of Living

No engagement.

ND
Article 28 Social & International Order

No engagement.

ND
Article 29 Duties to Community

No engagement.

ND
Article 30 No Destruction of Rights

No engagement.

Structural Channel
What the site does
Element Modifier Affects Note
+0.10
Article 19 Freedom of Expression
Medium Advocacy
Structural
+0.10
Context Modifier
ND
SETL
0.00

Blog is publicly accessible without paywalls or authentication, enabling free information access.

+0.10
Article 26 Education
Medium Advocacy
Structural
+0.10
Context Modifier
ND
SETL
+0.09

Blog format and open access enable free technical education without barriers or prerequisites.

+0.05
Article 27 Cultural Participation
Medium Advocacy
Structural
+0.05
Context Modifier
ND
SETL
+0.07

Blog format with clear author attribution supports participation in and contribution to technical culture.

ND
Preamble Preamble
Low

N/A

ND
Article 1 Freedom, Equality, Brotherhood

N/A

ND
Article 2 Non-Discrimination

N/A

ND
Article 3 Life, Liberty, Security

N/A

ND
Article 4 No Slavery

N/A

ND
Article 5 No Torture

N/A

ND
Article 6 Legal Personhood

N/A

ND
Article 7 Equality Before Law

N/A

ND
Article 8 Right to Remedy

N/A

ND
Article 9 No Arbitrary Detention

N/A

ND
Article 10 Fair Hearing

N/A

ND
Article 11 Presumption of Innocence

N/A

ND
Article 12 Privacy
Medium Framing

N/A

ND
Article 13 Freedom of Movement

N/A

ND
Article 14 Asylum

N/A

ND
Article 15 Nationality

N/A

ND
Article 16 Marriage & Family

N/A

ND
Article 17 Property

N/A

ND
Article 18 Freedom of Thought

N/A

ND
Article 20 Assembly & Association

N/A

ND
Article 21 Political Participation

N/A

ND
Article 22 Social Security

N/A

ND
Article 23 Work & Equal Pay
Medium Framing

N/A

ND
Article 24 Rest & Leisure

N/A

ND
Article 25 Standard of Living

N/A

ND
Article 28 Social & International Order

N/A

ND
Article 29 Duties to Community

N/A

ND
Article 30 No Destruction of Rights

N/A

Supplementary Signals
How this content communicates, beyond directional lean. Learn more
Epistemic Quality
How well-sourced and evidence-based is this content?
0.65 medium claims
Sources
0.7
Evidence
0.8
Uncertainty
0.5
Purpose
0.6
Propaganda Flags
2 manipulative rhetoric techniques found
2 techniques detected
loaded language
Framing investigation as mystery-solving adventure: 'our agent traced a flaky test...following a trail from job metadata to raw log output...the agent follows a trail, query after query, as it narrows in on a root cause.' Appeals to reader's sense of detective work and discovery.
appeal to authority
Authors invoke decade of experience: 'We spent a decade building and scaling CI systems at Docker and Dagger' establishes credibility through tenure at well-known infrastructure companies.
Emotional Tone
Emotional character: positive/negative, intensity, authority
measured
Valence
+0.4
Arousal
0.3
Dominance
0.5
Transparency
Does the content identify its author and disclose interests?
0.50
✓ Author ✗ Conflicts ✗ Funding
More signals: context, framing & audience
Solution Orientation
Does this content offer solutions or only describe problems?
0.70 solution oriented
Reader Agency
0.5
Stakeholder Voice
Whose perspectives are represented in this content?
0.20 1 perspective
Speaks: corporation
About: individuals
Temporal Framing
Is this content looking backward, at the present, or forward?
present immediate
Geographic Scope
What geographic area does this content cover?
global
Complexity
How accessible is this content to a general audience?
technical high jargon domain specific
Longitudinal 764 HN snapshots · 59 evals
+1 0 −1 HN
Audit Trail 79 entries
2026-03-02 10:13 eval_success Evaluated: Mild positive (0.19) - -
2026-03-02 10:13 eval Evaluated by deepseek-v3.2: +0.19 (Mild positive) 11,583 tokens +0.19
2026-03-02 09:59 eval_success Evaluated: Neutral (0.00) - -
2026-03-02 09:59 eval Evaluated by deepseek-v3.2: 0.00 (Neutral) 11,771 tokens -0.23
2026-03-02 09:59 rater_validation_warn Validation warnings for model deepseek-v3.2: 0W 31R - -
2026-03-02 01:02 dlq_auto_replay DLQ auto-replay: message 98030 re-enqueued - -
2026-03-01 00:03 dlq_auto_replay DLQ auto-replay: message 97995 re-enqueued - -
2026-02-28 23:04 dlq Dead-lettered after 1 attempts: We gave terabytes of CI logs to an LLM - -
2026-02-28 23:04 eval_failure Evaluation failed: AbortError: The operation was aborted - -
2026-02-28 23:03 eval_failure Evaluation failed: AbortError: The operation was aborted - -
2026-02-28 22:51 dlq Dead-lettered after 1 attempts: We gave terabytes of CI logs to an LLM - -
2026-02-28 22:51 eval_failure Evaluation failed: AbortError: The operation was aborted - -
2026-02-28 22:29 eval_failure Evaluation failed: AbortError: The operation was aborted - -
2026-02-28 22:06 dlq Dead-lettered after 1 attempts: We gave terabytes of CI logs to an LLM - -
2026-02-28 22:06 eval_failure Evaluation failed: AbortError: The operation was aborted - -
2026-02-28 22:02 eval_failure Evaluation failed: AbortError: The operation was aborted - -
2026-02-28 21:09 dlq Dead-lettered after 1 attempts: We gave terabytes of CI logs to an LLM - -
2026-02-28 21:09 eval_failure Evaluation failed: AbortError: The operation was aborted - -
2026-02-28 20:56 eval_failure Evaluation failed: AbortError: The operation was aborted - -
2026-02-28 16:39 eval_success Lite evaluated: Neutral (0.00) - -
2026-02-28 16:39 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 15:17 eval_success Lite evaluated: Neutral (0.00) - -
2026-02-28 15:17 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 15:14 eval_success Lite evaluated: Neutral (0.00) - -
2026-02-28 15:14 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 15:12 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 13:15 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 12:50 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 12:47 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 12:37 eval Evaluated by claude-haiku-4-5-20251001: -0.01 (Neutral) -0.16
2026-02-28 12:35 eval Evaluated by deepseek-v3.2: +0.23 (Mild positive) 11,097 tokens +0.22
2026-02-28 12:10 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 11:23 eval Evaluated by claude-haiku-4-5-20251001: +0.15 (Mild positive) +0.12
2026-02-28 10:23 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 10:21 eval Evaluated by deepseek-v3.2: +0.01 (Neutral) 12,163 tokens -0.09
2026-02-28 10:15 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 10:01 eval Evaluated by claude-haiku-4-5-20251001: +0.03 (Neutral) +0.10
2026-02-28 10:01 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 09:56 eval Evaluated by claude-haiku-4-5-20251001: -0.07 (Neutral)
2026-02-28 09:36 eval Evaluated by deepseek-v3.2: +0.10 (Mild positive) 11,739 tokens -0.14
2026-02-28 08:56 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 08:28 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 07:54 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 07:35 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 07:30 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 07:27 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 06:35 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 06:30 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 06:27 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 06:16 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 06:03 eval Evaluated by deepseek-v3.2: +0.24 (Mild positive) 11,362 tokens +0.23
2026-02-28 05:51 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 05:38 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 05:37 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 05:27 eval Evaluated by deepseek-v3.2: +0.01 (Neutral) 11,452 tokens
2026-02-28 05:20 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 04:54 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 04:53 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 04:40 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 04:29 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 04:24 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 04:05 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 03:38 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 03:35 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 03:29 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 03:17 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 02:55 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 02:26 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 02:26 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 02:14 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 02:09 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 01:57 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 01:46 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 01:37 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech blog with no rights stance
2026-02-28 01:30 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 01:22 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 01:13 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral)
reasoning
Tech blog with no rights stance
2026-02-28 01:01 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED, neutral tech blog post
2026-02-28 00:55 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral)
reasoning
ED, neutral tech blog post