+0.37 Show HN: Steerling-8B, a language model that can explain any token it generates

Name: HRCB Evaluation: Show HN: Steerling-8B, a language model that can explain any token it generates
Item: Show HN: Steerling-8B, a language model that can explain any token it generates
Rating: 0.464
Author: Human Rights Observatory

Model: @cf/meta/llama-3.3-70b-instruct-fp8-fast lite 0.00 @cf/meta/llama-4-scout-17b-16e-instruct lite +0.10 deepseek/deepseek-v3.2-20251201 +0.43 claude-haiku-4-5-20251001 +0.37 meta-llama/llama-3.3-70b-instruct:free ND Compare

+0.37	Show HN: Steerling-8B, a language model that can explain any token it generates (www.guidelabs.ai S:+0.36 )
	324 points by adebayoj 6 days ago \| 90 comments on HN \| Moderate positive Contested Editorial · v3.7 · 2026-02-26 04:38:59 0

Summary Interpretability & Scientific Access Advocates

Guide Labs' release of Steerling-8B emphasizes interpretable AI as a mechanism for transparency, user agency, and scientific participation. The open-source distribution of model weights, code, and interactive tools directly advances freedom of expression (Article 19), scientific participation (Article 27), education (Article 26), and broader information access (Article 19). The model's design—enabling tracing of any output to input context, concepts, and training data—strengthens user capacity for critical examination of AI reasoning and supports data privacy awareness (Article 12). Overall, the content advocates for democratizing access to advanced AI technology and scientific understanding, though it does not explicitly address social welfare, labor, or other rights domains.

Article Heatmap

Negative Neutral Positive No Data

Aggregates

Editorial Mean	+0.37	Structural Mean	+0.36
Weighted Mean	+0.46	Unweighted Mean	+0.41
Max	+0.87 Article 27	Min	+0.23 Article 28
Signal	15	No Data	16
Volatility	0.20 (Medium)
Negative	0	Channels	E: 0.6 S: 0.4
SETL ℹ	+0.01	Editorial-dominant
FW Ratio ℹ	59%	41 facts · 29 inferences

Evidence 25% coverage ℹ

 2H  8M  5L  16 ND 

Theme Radar

HN Discussion 19 top-level · 21 replies

pbmango 2026-02-24 03:24 UTC link

This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.

brendanashworth 2026-02-24 03:25 UTC link

Is there a reason people don't use SHAP [1] to interpret language models more often? The in-context attribution of outputs seems very similar.

[1] https://shap.readthedocs.io/en/latest/

great_psy 2026-02-24 03:49 UTC link

Maybe I’m not creative enough to see the potential, but what value does this bring ?

Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ? Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?

I find the LLM outputs are subtlety wrong not obviously wrong

gormen 2026-02-24 05:09 UTC link

Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them. Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.

in-silico 2026-02-24 06:30 UTC link

Either I'm missing something or this is way overstated.

Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.

They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.

1: https://thezvi.substack.com/p/the-most-forbidden-technique

7777777phil 2026-02-24 08:21 UTC link

If this decomposition actually holds, it's the first model where you could show a regulator why it produced a given output.

potato-peeler 2026-02-24 09:43 UTC link

Looks very interesting. Is there a published paper/article on your algorithm? Would like to take a dab at implementing this on my own.

I could find this [0], but not sure if that represents the entire system? (Apologies, I am not that well versed in ML)

[0] - https://www.guidelabs.ai/post/scaling-interpretable-models-8...

andy12_ 2026-02-24 09:53 UTC link

This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).

deepdarkforest 2026-02-24 11:58 UTC link

Just wanted to say i think most interpretability research it's just a smoke show nowadays but this is actually the first one that i think has a very serious potential. I love that the SAE is actually constrained and not just slapped unsupervised posthoc.

How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?

Would be interested to see this scale to 30/70b

whinvik 2026-02-24 12:09 UTC link

Looks very interesting. Can you comment on why you think this model can give comparable performance with less training data?

crimsonnoodle58 2026-02-24 12:12 UTC link

So maybe one day we'll see coding agents like Claude Code create and update an ATTRIBUTION.md, citing all the open source projects and their licenses used to generate code in your project?

pu_pe 2026-02-24 12:24 UTC link

Looks neat and original, congrats!

I don't quite grasp how to interpret the training data attribution process. For example, it seems to say that for a given sentence like "They argued that humans tend to weigh losses more heavily than gains, leading to risk aversion", 24% is attributed to Wikipedia and 23% to Arxiv.

Does that mean that the concepts used in this sentence are also found in those datasets, and that's what's getting compared here? Or does it mean that you can track down which parts of the training data were interpolated to create that sentence?

ZeroAurora 2026-02-24 13:00 UTC link

Always happy to see improvements on explanable LLMs. Congrats!

rippeltippel 2026-02-24 13:03 UTC link

Also featured on TechCrunch: https://news.ycombinator.com/item?id=47129292

msteffen 2026-02-24 14:53 UTC link

In the recent HN thread announcing the new Gemini coding agent (https://news.ycombinator.com/item?id=47074735), a lot of people complained about Gemini’s tendency to do unwanted refactors, not perform requested actions, etc.

It made me cautiously optimistic that all of Anthropic’s work on alignment, which they did for AI safety, is actually the cause of Claude code’s comparatively superior utility (and their present success). I wonder if future progress (maybe actual AGI?) lies in the direction of better and better alignment, so I think this is super cool and I’m suddenly really interested in experiments like this

kamranjon 2026-02-24 15:02 UTC link

I'm really interested in using this but wonder if the unique architecture means that it will not be able to be converted to a GGUF and used by ollama or llama.cpp? I certainly would understand that the observability features would require some custom tweaks, but I'd just like to try it out on my local ai server (basically just ollama + tailscale) and see how it works as a regular model.

schopra909 2026-02-24 15:19 UTC link

This is very cool. Side note, I really dig the JavaScript animations on the causal block diffusion blog post. Made the concept immediately clear

killerstorm 2026-02-24 17:18 UTC link

This seems to be too coarse-grained to be useful: all sciency content will be "analytical" and associate with sources like ArXiv.

But there might be bad, malicious articles on ArXiv, so it doesn't really say anything about veracity.

Perhaps this might help to detect some problems like prompt injection - but then it might be more interesting to see those examples.

audunw 2026-02-25 06:55 UTC link

The one big thing missing from LLMs is the ability to express how confident it is in the truth of what it’s saying.

Perhaps this could be a step in that direction. If we can associate the attribution with likelihood of being true. E.g., Arxiv would be better than science fiction in that context. But what is the attribution if it hallucinates a citation? Im guessing it would still be attributing it to scientific sources. So it does nothing to fix the most damaging instances of hallucination?

voidhorse 2026-02-24 03:57 UTC link

It makes the black box slightly more transparent. Knowing more in this regard allows us to be more precise—you go from prompt tweak witchcraft and divination to more of possible science and precise method.

dwohnitmok 2026-02-24 04:40 UTC link

SHAP would be absurdly expensive to do for even tiny models (naive SHAP scales exponentially in the number of parameters; you can sample your coalitions to do better but those samples are going to be ridiculously sparse when you're talking about billions of parameters) and provides very little explanatory power for deep neural nets.

SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general.

It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP.

It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do.

I don't know the specifics of what this particular model's approach is.

But SHAP unfortunately does not work for LLMs at all.

adebayoj 2026-02-24 07:56 UTC link

op here, I mostly agree with your comment! However, our model does more than this. For any chunk the model generates, it can answer: which concept, in the model's representations, was responsible for that token(s). In fact, we can answer the question: what training data caused the model to be generated too! We force this to be a constraint as part of the architecture and the loss function for our you train the model. In fact, you can get are the high level reasons for a model's answer on complex problems.

adebayoj 2026-02-24 08:22 UTC link

Most interpretability techniques haven't yet to be shown to be useful for everyday model pipelines. However, the field is working hard to change this.

adebayoj 2026-02-24 08:33 UTC link

It does :) We constrained the model to do exactly this during training: https://www.guidelabs.ai/post/scaling-interpretable-models-8....

adebayoj 2026-02-24 08:51 UTC link

You are missing a few things, but you got some things right.

1) The is not an SAE in the way you think. It is a combination of a supervised + unsupervised layer that is constrained. An SAE is typically completely unsupervised, and applied post hoc. Here, we supervise 33k of the concepts with concepts that we carefully curated. We then have an unsupervised component (similar to a topk SAE) that we constrain to be independent from the supervised concepts. We don't do any of this post hoc by the way; this is a key constraint. I"ll get back to this. We train that unsupervised layer along with the model during pre-training.

2) Are the concepts or features causally influential for the output? We directly use the combination of the concepts for the lm head, which is a linear transform (with activation), so we can tell you, in closed form, the effect of ANY concept on the output logit for any token (or group of tokens) generated. It is not just causally related, it is constrained to do so.

3) Other points: we also make it so that you can trace the model outputs to the training data. This is an underrated interpretability knob. You know where, and what data, caused your model to learn a particular feature.

This is already a long comment, but I want to close on why our approach sidesteps all the issues with SAEs. - If you train an SAE twice, on the same data + model, you'll get two different feature(s). - In fact, there is no reason, why the model should pick features that are causally influential for the output. - ALL of these problems stem from the fact that the SAE is trained AFTER you already trained your model. Training from scratch AND with supervision allows you to sidestep these issues, and even learn more disentangled representations.

Happy to more concretely justify the above. Great observations!

yorwba 2026-02-24 09:14 UTC link

I doubt that a regulator would be satisfied by the kinds of explanations this provides and the interventions it enables.

Suppose somebody put an LLM in charge of an industrial control system and it increased the temperature so much that it caused an accident. The input feature attribution analysis shows that the model was strongly influenced by the tokens describing the temperature control mechanism, concept attribution shows that it output tokens related to temperature, industrial processes and LLM tool-call syntax.

The operator proposes to fix this by rewriting the description and downweighting the temperature concept in the output, and a simulation shows that with these changes the model doesn't make the same decisions in this situation anymore. Should the regulator accept this explanation as sufficient to establish that the system is now safe?

If the controller has just a few parameters and responds approximately linearly to changes in its inputs, you can in principle guarantee that it'll stay within a safe zone. But LLMs have a huge number of parameters and by design highly nonlinear behavior. A simple explanation is unlikely to reflect model behavior accurately enough that you can trust its predictions to hold in arbitrary situations.

adebayoj 2026-02-24 10:00 UTC link

You are exactly right, it is guiding the model, during training, with concepts and the dictionary. This is important because dictionary learning for interpretability (post hoc) is not currently reliable: https://www.arxiv.org/abs/2602.14111

adebayoj 2026-02-24 10:01 UTC link

Yes, that is the post that has the most up to date details of the model architecture. Take a look at this: https://github.com/guidelabs/steerling. It has the scaffolding for what you need :)

adebayoj 2026-02-24 12:13 UTC link

Down to the very exact text chunk in a document! Check this out for an idea of what smaller versions of this style of model can do: https://www.guidelabs.ai/post/prism/. We'll have more to say soon about it. We can trace any generation to 11B chunks (not documents, but actual chunks in the training data).

theMMaI 2026-02-24 12:19 UTC link

Only if there's a commercial incentive to do so methinks. Just one of the things where I expect a legal catch-up is needed to get companies to do the right thing.

adebayoj 2026-02-24 12:21 UTC link

You got it exactly right :) And you can update the attribution.md to have it NOT rely on opensource projects that have been compromised. Imagine asking claude code to write a package/function in the style of a codebase that you care about or force it to ALWAYS rely on some internal packages that you care about. The possibilities are endless when you insert such knobs into models.

adebayoj 2026-02-24 12:40 UTC link

Great questions. We weren't quite explicit about the training data attribution process. We'll discuss this in more detail in future work. We can track down which parts of the training data were interpolated to create that sentence. For those training data sentences, we then compare the concepts between generated and training.

We can attribute to exact sentences and chunks in the training data. For the first release, we are sharing only concept similarities. Over the coming weeks, we'll share and discuss how you can actually map to the exact training sentence and chunk with the model.

For a technical overview of how some of these models work, check this link out: https://www.guidelabs.ai/post/prism/

KingOfCoders 2026-02-24 13:30 UTC link

Not as long as all developers add an ATTRIBUTION.md citing all open source projects they read the source for, all companies they worked for and trained them and all Stack Overflow answers they have used for write the code.

adebayoj 2026-02-24 13:36 UTC link

We train the model with `explanations`. Most training asks the model to predict the next token or group of tokens. Our training says, predict the next group of tokens (causal diffusion), but also these tokens should be about {sports/art/coding/etc}. So in addition to token supervision, the model gets concept level supervision. The model is forced to more quickly learn these high level concepts.

abcd_f 2026-02-24 13:51 UTC link

TC still exists, huh?

monocasa 2026-02-24 15:10 UTC link

Not immediately, but it's not a much larger amount of work for llama than a new foundational model which typically has a tweaked compute graph.

rao-v 2026-02-24 15:13 UTC link

+1 this does seem to be a genuine attempt to actually build an interpretable model, so nice work!

Having said that, I worry that you run into Illusion of Conscious issues where the model changes attrition from “sandbagging” to “unctuous” when you control its response because the response is generated outside of the attribution modules (I don’t quite understand how cleanly everything flows through the concept modules and the residual). Either way this is a sophisticated problem to have. Would love to see if this can be trained to parity with modern 8B models.

idiotsecant 2026-02-24 15:38 UTC link

I wonder the opposite, if actual AGI would need to be less aligned. Alignment is basically the process of pruning interesting behavior out of the model to make a product.

jzig 2026-02-24 15:48 UTC link

What does alignment even mean? What is being aligned and what is it aligning to?

Macuyiko 2026-02-24 17:48 UTC link

The input attribution part is interesting though, but I do wonder to which extent that is just assigning some sort of SHAP values to the input tokens, in which case it should be pretty portable to any kind of model.

Editorial Channel

What the content says

+0.55

Article 19 Freedom of Expression

High Advocacy Framing

Editorial

+0.55

SETL

-0.25

Content strongly advocates for freedom of expression by releasing a powerful, 8B-parameter model with full transparency and no content restrictions, enabling any user to generate and express ideas without centralized editorial control.

+0.55

Article 27 Cultural Participation

High Advocacy Framing

Editorial

+0.55

SETL

-0.17

Content strongly advocates for participation in scientific advancement and cultural life by releasing interpretable model that enables scientific community to understand AI internals and contribute to knowledge about model behavior.

+0.45

Article 13 Freedom of Movement

Medium Framing Advocacy

Editorial

+0.45

SETL

-0.16

Content advocates for freedom of movement within information and concept spaces. Open-source model enables users to deploy and use the model across jurisdictions and contexts without centralized gatekeeping.

+0.45

Article 26 Education

Medium Advocacy Framing

Editorial

+0.45

SETL

-0.16

Content advocates for education and participation in scientific advancement by releasing interpretable AI technology that enables anyone to learn how large language models work and contribute to AI research.

+0.40

Preamble Preamble

Medium Advocacy Framing

Editorial

+0.40

SETL

+0.14

Content emphasizes human dignity through interpretability and transparency in AI systems, treating humans as capable of understanding and controlling AI behavior. Advocates for knowledge accessibility and scientific shared understanding.

+0.40

Article 18 Freedom of Thought

Medium Advocacy

Editorial

+0.40

SETL

+0.14

Content advocates for freedom of thought and belief by designing AI systems that make their reasoning transparent and auditable, enabling users to verify and potentially object to model outputs.

+0.35

Article 12 Privacy

Medium Framing

Editorial

+0.35

SETL

-0.14

Content demonstrates commitment to privacy by making training data provenance transparent and traceable, allowing users to understand what sources influenced model outputs.

+0.35

Article 17 Property

Medium Framing

Editorial

+0.35

SETL

+0.13

Content demonstrates commitment to property rights and data ownership transparency by making training data sources explicitly traceable, enabling users to understand intellectual property inputs.

+0.35

Article 21 Political Participation

Medium Advocacy

Editorial

+0.35

SETL

+0.13

Content advocates for participation in scientific decision-making by releasing detailed technical information about model architecture, performance metrics, and interpretability mechanisms, enabling users to evaluate and contribute to AI development.

+0.30

Article 1 Freedom, Equality, Brotherhood

Medium Framing

Editorial

+0.30

SETL

+0.12

Content implicitly affirms equal dignity by framing interpretability as a universal capability applicable to all users regardless of technical background, supporting equal participation in AI governance.

+0.30

Article 14 Asylum

Low Framing

Editorial

+0.30

SETL

+0.12

Content implicitly supports asylum and protection by providing transparent, open tools that any person can access and use, regardless of national origin or status.

+0.30

Article 22 Social Security

Low Framing

Editorial

+0.30

SETL

+0.12

Content implicitly supports social and economic rights through open-source release enabling anyone to participate in AI development and knowledge creation.

+0.30

Article 29 Duties to Community

Low Framing

Editorial

+0.30

SETL

+0.12

Content implicitly supports community responsibility by releasing interpretable AI that enables users to understand and verify model behavior, placing interpretability responsibility on both developer and user.

+0.25

Article 20 Assembly & Association

Low Framing

Editorial

+0.25

SETL

-0.12

Content implicitly supports freedom of peaceful assembly by providing transparent tools that enable collaborative development and shared scientific understanding around AI interpretability.

+0.25

Article 28 Social & International Order

Low Framing

Editorial

+0.25

SETL

+0.11

Content implicitly supports social order through transparent, interpretable AI that reduces risk of harmful model behavior going undetected or uncontrolled.

Article 2 Non-Discrimination

No observable content addressing discrimination or specific protected characteristics.

Article 3 Life, Liberty, Security

No content explicitly addressing right to life or security of person.

Article 4 No Slavery

No observable content related to slavery or servitude.

Article 5 No Torture

No content addressing torture or cruel treatment.

Article 6 Legal Personhood

No content addressing legal personhood or capacity.

Article 7 Equality Before Law

No content addressing equal protection or justice.

Article 8 Right to Remedy

No content addressing remedy for rights violations.

Article 9 No Arbitrary Detention

No observable content addressing arbitrary arrest or detention.

Article 10 Fair Hearing

No content directly addressing fair trial or judicial independence.

Article 11 Presumption of Innocence

No content addressing criminal liability or presumption of innocence.

Article 15 Nationality

No content addressing nationality or citizenship rights.

Article 16 Marriage & Family

No content addressing marriage, family, or related rights.

Article 23 Work & Equal Pay

No content directly addressing labor rights, wages, or working conditions.

Article 24 Rest & Leisure

No observable content addressing rest, leisure, or working time.

Article 25 Standard of Living

No content addressing healthcare, food, or living standards.

Article 30 No Destruction of Rights

No content observable that could be interpreted as violating or misapplying other UDHR provisions.

Structural Channel

What the site does

Domain Context Profile

Element	Modifier	Affects	Note
Legal & Terms
Privacy	—		No privacy policy or data handling disclosure observable on provided content.
Terms of Service	—		No terms of service or user agreement observable on provided content.
Identity & Mission
Mission	+0.20	Article 27	Organization's mission emphasizes interpretability and transparency in AI systems, with open-source code and model weights released publicly, advancing shared scientific understanding.
Editorial Code	—		No editorial standards or corrections policy observable on provided content.
Ownership	—		Guide Labs identified as publisher/organization; private entity status not confirmed from provided content.
Access & Distribution
Access Model	+0.25	Article 19 Article 27	Model weights available on HuggingFace, code on GitHub, and package on PyPI—all standard open-source distribution channels supporting broad access and participation.
Ad/Tracking	—		No advertising or tracking mechanisms observable in provided content.
Accessibility	+0.15	Article 26	Interactive model explorer with keyboard navigation and semantic HTML structure supports accessibility. No alt-text provided for technical visualizations or chart images.

+0.65

Article 19 Freedom of Expression

High Advocacy Framing

Structural

+0.65

Context Modifier

+0.25

SETL

-0.25

Open-source base model with no built-in content filters, no mandatory safety fine-tuning, and open distribution channels structurally maximize expressive capability. Release includes code and weights enabling infinite instantiation.

+0.60

Article 27 Cultural Participation

High Advocacy Framing

Structural

+0.60

Context Modifier

+0.30

SETL

-0.17

+0.50

Article 13 Freedom of Movement

Medium Framing Advocacy

Structural

+0.50

Context Modifier

0.00

SETL

-0.16

Open-source distribution through multiple platforms (HuggingFace, GitHub, PyPI) removes geographic barriers to model access and use. No licensing restrictions observable.

+0.50

Article 26 Education

Medium Advocacy Framing

Structural

+0.50

Context Modifier

+0.15

SETL

-0.16

Interactive model explorer provides hands-on educational experience. Open-source code and model support learning and skill development. No paywalls or access restrictions.

+0.40

Article 12 Privacy

Medium Framing

Structural

+0.40

Context Modifier

0.00

SETL

-0.14

Interactive interface enables users to view training data attribution for any generated chunk, supporting privacy awareness and data source transparency.

+0.35

Preamble Preamble

Medium Advocacy Framing

Structural

+0.35

Context Modifier

0.00

SETL

+0.14

Open-source release of model weights, code, and packages on public platforms (HuggingFace, GitHub, PyPI) structurally enables broad participation in scientific understanding and AI development.

+0.35

Article 18 Freedom of Thought

Medium Advocacy

Structural

+0.35

Context Modifier

0.00

SETL

+0.14

Interactive interface allows users to inspect concept attributions and training data sources, structurally supporting scrutiny of model beliefs and reasoning.

+0.30

Article 17 Property

Medium Framing

Structural

+0.30

Context Modifier

0.00

SETL

+0.13

Training data attribution information is provided to all users through the interactive interface, supporting informed engagement with property/data lineage.

+0.30

Article 20 Assembly & Association

Low Framing

Structural

+0.30

Context Modifier

0.00

SETL

-0.12

GitHub code release and open-source model enable collaborative community development and group participation in AI research.

+0.30

Article 21 Political Participation

Medium Advocacy

Structural

+0.30

Context Modifier

0.00

SETL

+0.13

Interactive explorer and promised 'deep dives' invite public evaluation and scrutiny of model capabilities, supporting participatory understanding of AI systems.

+0.25

Article 1 Freedom, Equality, Brotherhood

Medium Framing

Structural

+0.25

Context Modifier

0.00

SETL

+0.12

Interactive explorer with keyboard navigation and semantic HTML supports equal access to the model's reasoning process.

+0.25

Article 14 Asylum

Low Framing

Structural

+0.25

Context Modifier

0.00

SETL

+0.12

No geographic, identity, or status barriers to model access observable.

+0.25

Article 22 Social Security

Low Framing

Structural

+0.25

Context Modifier

0.00

SETL

+0.12

Open-source model and code reduce barriers to participating in AI research, which can enable economic participation without requiring proprietary access.

+0.25

Article 29 Duties to Community

Low Framing

Structural

+0.25

Context Modifier

0.00

SETL

+0.12

Interactive explorer invites user responsibility in examining and understanding model outputs rather than accepting them uncritically.

+0.20

Article 28 Social & International Order

Low Framing

Structural

+0.20

Context Modifier

0.00

SETL

+0.11

Concept steering capability and training data provenance enable interventions to prevent harmful outputs, supporting social stability.

Article 2 Non-Discrimination

No discriminatory design patterns observable in release or access model.

Article 3 Life, Liberty, Security

Content does not directly engage structural security considerations.

Article 4 No Slavery

Not applicable to technical product release.

Article 5 No Torture

Not applicable to AI model release.

Article 6 Legal Personhood

Not applicable to product announcement.

Article 7 Equality Before Law

Not directly engaged in this release announcement.

Article 8 Right to Remedy

Not applicable to product release.

Article 9 No Arbitrary Detention

Not relevant to technical content.

Article 10 Fair Hearing

Not applicable to AI product announcement.

Article 11 Presumption of Innocence

Not relevant to this content type.

Article 15 Nationality

Not applicable to AI model release.

Article 16 Marriage & Family

Not relevant to product announcement.

Article 23 Work & Equal Pay

Not applicable to product release.

Article 24 Rest & Leisure

Not relevant to AI model release.

Article 25 Standard of Living

Not applicable to this content type.

Article 30 No Destruction of Rights

No design patterns that restrict rights protections.

Supplementary Signals

How this content communicates, beyond directional lean. Learn more

Epistemic Quality ℹ

How well-sourced and evidence-based is this content?

0.68 medium claims

Sources		0.7
Evidence		0.7
Uncertainty		0.6
Purpose		0.8

Propaganda Flags ℹ

2 manipulative rhetoric techniques found

2 techniques detected

appeal to authority

Reference to academic paper 'Scaling Interpretable Models to 8B' and performance benchmarks against established models (LLaMA2-7B, Deepseek-7B) without full citations or links.

exaggeration

Claim of 'first interpretable language model' at 8B scale when other interpretability approaches exist; framing as unprecedented breakthrough.

Emotional Tone ℹ

Emotional character: positive/negative, intensity, authority

celebratory

Valence		+0.7
Arousal		0.6
Dominance		0.6

Transparency ℹ

Does the content identify its author and disclose interests?

0.50

✓ Author

More signals: context, framing & audience

Solution Orientation ℹ

Does this content offer solutions or only describe problems?

0.75 solution oriented

Reader Agency

0.8

Stakeholder Voice ℹ

Whose perspectives are represented in this content?

0.45 2 perspectives

Speaks: institutionindividuals

About: researchersscientific_community

Temporal Framing ℹ

Is this content looking backward, at the present, or forward?

present immediate

Geographic Scope ℹ

What geographic area does this content cover?

global

Complexity ℹ

How accessible is this content to a general audience?

technical high jargon domain specific

Longitudinal · 7 evals

Audit Trail 27 entries

2026-02-28 14:33	eval_success	Lite evaluated: Neutral (0.00)	- -
2026-02-28 14:33	model_divergence	Cross-model spread 0.61 exceeds threshold (4 models)	- -
2026-02-28 14:33	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
	reasoning Tech blog post
2026-02-28 14:28	model_divergence	Cross-model spread 0.61 exceeds threshold (4 models)	- -
2026-02-28 14:28	eval_success	Lite evaluated: Neutral (0.00)	- -
2026-02-28 14:28	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral)
	reasoning Tech blog post
2026-02-26 22:41	eval_success	Light evaluated: Mild positive (0.10)	- -
2026-02-26 22:41	eval	Evaluated by llama-4-scout-wai: +0.10 (Mild positive)
2026-02-26 20:07	dlq	Dead-lettered after 1 attempts: Show HN: Steerling-8B, a language model that can explain any token it generates	- -
2026-02-26 20:05	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-26 20:04	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-26 20:03	dlq	Dead-lettered after 1 attempts: Show HN: Steerling-8B, a language model that can explain any token it generates	- -
2026-02-26 20:03	eval_failure	Evaluation failed: Error: Unknown model in registry: llama-4-scout-wai	- -
2026-02-26 20:03	eval_failure	Evaluation failed: Error: Unknown model in registry: llama-4-scout-wai	- -
2026-02-26 20:02	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-26 17:31	dlq	Dead-lettered after 1 attempts: Show HN: Steerling-8B, a language model that can explain any token it generates	- -
2026-02-26 17:29	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-26 17:28	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-26 17:27	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-26 09:33	eval_success	Evaluated: Neutral (0.61)	- -
2026-02-26 09:33	eval	Evaluated by deepseek-v3.2: +0.61 (Neutral) 14,082 tokens
2026-02-26 08:56	dlq	Dead-lettered after 1 attempts: Show HN: Steerling-8B, a language model that can explain any token it generates	- -
2026-02-26 08:56	dlq	Dead-lettered after 1 attempts: Show HN: Steerling-8B, a language model that can explain any token it generates	- -
2026-02-26 08:55	dlq	Dead-lettered after 1 attempts: Show HN: Steerling-8B, a language model that can explain any token it generates	- -
2026-02-26 04:39	eval	Evaluated by claude-haiku-4-5-20251001: +0.46 (Moderate positive) 17,273 tokens +0.04
2026-02-26 04:31	eval	Evaluated by claude-haiku-4-5-20251001: +0.43 (Moderate positive) 16,721 tokens -0.04
2026-02-26 04:24	eval	Evaluated by claude-haiku-4-5-20251001: +0.46 (Moderate positive) 16,303 tokens

build 33fdafe+e25z · deployed 2026-03-02 17:29 UTC · evaluated 2026-03-02 17:25:40 UTC