Guide Labs' release of Steerling-8B emphasizes interpretable AI as a mechanism for transparency, user agency, and scientific participation. The open-source distribution of model weights, code, and interactive tools directly advances freedom of expression (Article 19), scientific participation (Article 27), education (Article 26), and broader information access (Article 19). The model's design—enabling tracing of any output to input context, concepts, and training data—strengthens user capacity for critical examination of AI reasoning and supports data privacy awareness (Article 12). Overall, the content advocates for democratizing access to advanced AI technology and scientific understanding, though it does not explicitly address social welfare, labor, or other rights domains.
This is very interesting. I don't see much discussion of interpretability in day to the day discourse of AI builders. I wonder if everyone assumes it to either be solved, or to be too out of reach to bother stopping and thinking about.
Maybe I’m not creative enough to see the potential, but what value does this bring ?
Given the example I saw about CRISPR, what does this model give over a different, non explaining model in the output ?
Does it really make me more confident in the output if I know the data came from Arxiv or Wikipedia ?
I find the LLM outputs are subtlety wrong not obviously wrong
Most interpretability methods fail for LLMs because they try to explain outputs without modeling the intent, constraints, or internal structure that produced them.
Token‑level attribution is useful, but without a framework for how the model reasons, you’re still explaining shadows on the wall.
Either I'm missing something or this is way overstated.
Steerling appears to be just a discrete diffusion model where the final hidden states are passed through a sparse autoencoder (a common interpretability layer) before the LM head.
They also use a loss that aligns the SAE'S activations with labelled concepts? However, this is an example of "The Most Forbidden Technique" [1], and could make the model appear interpretable without the attributed concepts actually having causal effect on the model's decisions.
This seems really interesting. While Anthropic tried to use dictionary learning over an existing model to try to extract concepts, this almost feels like training the model alongside the dictionary itself (or rather, the model and the dictionary are intertwined).
Just wanted to say i think most interpretability research it's just a smoke show nowadays but this is actually the first one that i think has a very serious potential. I love that the SAE is actually constrained and not just slapped unsupervised posthoc.
How granular can you get the source data attribution? Down to individual let's say Wikipedia topics? Probably not urls?
So maybe one day we'll see coding agents like Claude Code create and update an ATTRIBUTION.md, citing all the open source projects and their licenses used to generate code in your project?
I don't quite grasp how to interpret the training data attribution process. For example, it seems to say that for a given sentence like "They argued that humans tend to weigh losses more heavily than gains, leading to risk aversion", 24% is attributed to Wikipedia and 23% to Arxiv.
Does that mean that the concepts used in this sentence are also found in those datasets, and that's what's getting compared here? Or does it mean that you can track down which parts of the training data were interpolated to create that sentence?
In the recent HN thread announcing the new Gemini coding agent (https://news.ycombinator.com/item?id=47074735), a lot of people complained about Gemini’s tendency to do unwanted refactors, not perform requested actions, etc.
It made me cautiously optimistic that all of Anthropic’s work on alignment, which they did for AI safety, is actually the cause of Claude code’s comparatively superior utility (and their present success). I wonder if future progress (maybe actual AGI?) lies in the direction of better and better alignment, so I think this is super cool and I’m suddenly really interested in experiments like this
I'm really interested in using this but wonder if the unique architecture means that it will not be able to be converted to a GGUF and used by ollama or llama.cpp? I certainly would understand that the observability features would require some custom tweaks, but I'd just like to try it out on my local ai server (basically just ollama + tailscale) and see how it works as a regular model.
The one big thing missing from LLMs is the ability to express how confident it is in the truth of what it’s saying.
Perhaps this could be a step in that direction. If we can associate the attribution with likelihood of being true. E.g., Arxiv would be better than science fiction in that context. But what is the attribution if it hallucinates a citation? Im guessing it would still be attributing it to scientific sources. So it does nothing to fix the most damaging instances of hallucination?
It makes the black box slightly more transparent. Knowing more in this regard allows us to be more precise—you go from prompt tweak witchcraft and divination to more of possible science and precise method.
SHAP would be absurdly expensive to do for even tiny models (naive SHAP scales exponentially in the number of parameters; you can sample your coalitions to do better but those samples are going to be ridiculously sparse when you're talking about billions of parameters) and provides very little explanatory power for deep neural nets.
SHAP basically does point by point ablation across all possible subsets, which really doesn't make sense for LLMs. This is simultaneously too specific and too general.
It's too specific because interesting LLM behavior often requires talking about what ensembles of neurons do (e.g. "circuits" if you're of the mechanistic interpretability bent), and SHAP's parameter-by-parameter approach is completely incapable of explaining this. This is exacerbated by the other that not all neurons are "semantically equal" in a deep network. Neurons in the deeper layers often do qualitatively different things than earlier layers and the ways they compose can completely confuse SHAP.
It's too general because parameters often play many roles at once (one specific hypothesis here is the superposition hypothesis) and so you need some way of splitting up a single parameter into interpretable parts that SHAP doesn't do.
I don't know the specifics of what this particular model's approach is.
But SHAP unfortunately does not work for LLMs at all.
op here, I mostly agree with your comment! However, our model does more than this. For any chunk the model generates, it can answer: which concept, in the model's representations, was responsible for that token(s). In fact, we can answer the question: what training data caused the model to be generated too! We force this to be a constraint as part of the architecture and the loss function for our you train the model. In fact, you can get are the high level reasons for a model's answer on complex problems.
You are missing a few things, but you got some things right.
1) The is not an SAE in the way you think. It is a combination of a supervised + unsupervised layer that is constrained. An SAE is typically completely unsupervised, and applied post hoc. Here, we supervise 33k of the concepts with concepts that we carefully curated. We then have an unsupervised component (similar to a topk SAE) that we constrain to be independent from the supervised concepts. We don't do any of this post hoc by the way; this is a key constraint. I"ll get back to this. We train that unsupervised layer along with the model during pre-training.
2) Are the concepts or features causally influential for the output? We directly use the combination of the concepts for the lm head, which is a linear transform (with activation), so we can tell you, in closed form, the effect of ANY concept on the output logit for any token (or group of tokens) generated. It is not just causally related, it is constrained to do so.
3) Other points: we also make it so that you can trace the model outputs to the training data. This is an underrated interpretability knob. You know where, and what data, caused your model to learn a particular feature.
This is already a long comment, but I want to close on why our approach sidesteps all the issues with SAEs.
- If you train an SAE twice, on the same data + model, you'll get two different feature(s).
- In fact, there is no reason, why the model should pick features that are causally influential for the output.
- ALL of these problems stem from the fact that the SAE is trained AFTER you already trained your model. Training from scratch AND with supervision allows you to sidestep these issues, and even learn more disentangled representations.
Happy to more concretely justify the above. Great observations!
I doubt that a regulator would be satisfied by the kinds of explanations this provides and the interventions it enables.
Suppose somebody put an LLM in charge of an industrial control system and it increased the temperature so much that it caused an accident. The input feature attribution analysis shows that the model was strongly influenced by the tokens describing the temperature control mechanism, concept attribution shows that it output tokens related to temperature, industrial processes and LLM tool-call syntax.
The operator proposes to fix this by rewriting the description and downweighting the temperature concept in the output, and a simulation shows that with these changes the model doesn't make the same decisions in this situation anymore. Should the regulator accept this explanation as sufficient to establish that the system is now safe?
If the controller has just a few parameters and responds approximately linearly to changes in its inputs, you can in principle guarantee that it'll stay within a safe zone. But LLMs have a huge number of parameters and by design highly nonlinear behavior. A simple explanation is unlikely to reflect model behavior accurately enough that you can trust its predictions to hold in arbitrary situations.
You are exactly right, it is guiding the model, during training, with concepts and the dictionary. This is important because dictionary learning for interpretability (post hoc) is not currently reliable: https://www.arxiv.org/abs/2602.14111
Yes, that is the post that has the most up to date details of the model architecture. Take a look at this: https://github.com/guidelabs/steerling. It has the scaffolding for what you need :)
Down to the very exact text chunk in a document! Check this out for an idea of what smaller versions of this style of model can do: https://www.guidelabs.ai/post/prism/. We'll have more to say soon about it. We can trace any generation to 11B chunks (not documents, but actual chunks in the training data).
Only if there's a commercial incentive to do so methinks. Just one of the things where I expect a legal catch-up is needed to get companies to do the right thing.
You got it exactly right :) And you can update the attribution.md to have it NOT rely on opensource projects that have been compromised. Imagine asking claude code to write a package/function in the style of a codebase that you care about or force it to ALWAYS rely on some internal packages that you care about. The possibilities are endless when you insert such knobs into models.
Great questions. We weren't quite explicit about the training data attribution process. We'll discuss this in more detail in future work. We can track down which parts of the training data were interpolated to create that sentence. For those training data sentences, we then compare the concepts between generated and training.
We can attribute to exact sentences and chunks in the training data. For the first release, we are sharing only concept similarities. Over the coming weeks, we'll share and discuss how you can actually map to the exact training sentence and chunk with the model.
Not as long as all developers add an ATTRIBUTION.md citing all open source projects they read the source for, all companies they worked for and trained them and all Stack Overflow answers they have used for write the code.
We train the model with `explanations`. Most training asks the model to predict the next token or group of tokens. Our training says, predict the next group of tokens (causal diffusion), but also these tokens should be about {sports/art/coding/etc}. So in addition to token supervision, the model gets concept level supervision. The model is forced to more quickly learn these high level concepts.
+1 this does seem to be a genuine attempt to actually build an interpretable model, so nice work!
Having said that, I worry that you run into Illusion of Conscious issues where the model changes attrition from “sandbagging” to “unctuous” when you control its response because the response is generated outside of the attribution modules (I don’t quite understand how cleanly everything flows through the concept modules and the residual). Either way this is a sophisticated problem to have. Would love to see if this can be trained to parity with modern 8B models.
I wonder the opposite, if actual AGI would need to be less aligned. Alignment is basically the process of pruning interesting behavior out of the model to make a product.
The input attribution part is interesting though, but I do wonder to which extent that is just assigning some sort of SHAP values to the input tokens, in which case it should be pretty portable to any kind of model.
Content strongly advocates for freedom of expression by releasing a powerful, 8B-parameter model with full transparency and no content restrictions, enabling any user to generate and express ideas without centralized editorial control.
FW Ratio: 50%
Observable Facts
Page states model is released as 'base model trained on 1.35T tokens' without mention of content filtering or mandatory alignment.
Model weights, code, and packages are freely available on public platforms with no licensing restrictions on use or deployment.
The model can generate text across 'various categories' as demonstrated in the explorer without apparent content blocking.
Inferences
Release of unrestricted base model prioritizes expressive freedom over centralized safety constraints.
Open-source distribution ensures no single entity can restrict how the model is used for expression.
DCP modifier for access_model (+0.25) applies: open distribution channels enable broad participation in expression.
Content strongly advocates for participation in scientific advancement and cultural life by releasing interpretable model that enables scientific community to understand AI internals and contribute to knowledge about model behavior.
FW Ratio: 57%
Observable Facts
Page promises 'deep dives' on 'Concept steering,' 'Concept discovery,' 'Alignment without fine-tuning,' and other research topics, inviting scientific discourse.
GitHub and HuggingFace distribution channels are standard infrastructure for scientific collaboration.
Model architecture paper referenced ('Scaling Interpretable Models to 8B') suggests peer review and scientific publication.
Interactive explorer is positioned as tool for scientific investigation and discovery.
Inferences
DCP modifier for mission (+0.2) applies: organizational mission explicitly emphasizes interpretability and transparency in AI, advancing shared scientific understanding.
DCP modifier for access_model (+0.25) applies: open distribution on HuggingFace, GitHub, PyPI enables broad scientific participation.
Release of 'discovered concepts' the model learns invites scientific investigation of AI behavior and knowledge representation.
Content advocates for freedom of movement within information and concept spaces. Open-source model enables users to deploy and use the model across jurisdictions and contexts without centralized gatekeeping.
FW Ratio: 60%
Observable Facts
Model weights available on HuggingFace, code on GitHub, and package on PyPI—all globally accessible platforms.
No geographic restrictions, licensing geofencing, or regional access limitations stated.
Base model can be freely deployed and used without corporate mediation.
Inferences
Open-source distribution structurally enables freedom of movement in accessing and deploying AI technology across borders.
Removal of centralized gatekeeping supports users' ability to freely use and transfer technology.
Content advocates for education and participation in scientific advancement by releasing interpretable AI technology that enables anyone to learn how large language models work and contribute to AI research.
FW Ratio: 50%
Observable Facts
Interactive explorer with visual concept attribution enables users to learn model reasoning through direct exploration.
Page includes detailed explanations of architecture, performance metrics, and interpretability measures accessible to educated readers.
Open-source code enables anyone to study, modify, and learn from implementation details.
Inferences
Interactive interface reduces barriers to understanding AI internals, democratizing advanced technical knowledge.
DCP modifier for accessibility (+0.15) applies: semantic HTML and keyboard navigation support inclusive learning.
DCP modifier for mission (+0.2) applies: interpretability release advances scientific understanding accessible to the community.
Content emphasizes human dignity through interpretability and transparency in AI systems, treating humans as capable of understanding and controlling AI behavior. Advocates for knowledge accessibility and scientific shared understanding.
FW Ratio: 60%
Observable Facts
Page states 'Steerling-8B can trace any token it generates to its input context, concepts a human can understand, and its training data.'
Model weights, code, and package are made available on HuggingFace, GitHub, and PyPI respectively.
Page emphasizes that interpretability is 'designed in from the start' rather than 'bolted on after the fact.'
Inferences
The emphasis on human-understandable concepts reflects a philosophical commitment to preserving human agency and comprehension in AI systems.
Open-source distribution channels suggest organizational commitment to democratizing access to advanced AI technology.
Content advocates for freedom of thought and belief by designing AI systems that make their reasoning transparent and auditable, enabling users to verify and potentially object to model outputs.
FW Ratio: 60%
Observable Facts
Page describes ability to see 'the ranked list of concepts...that the model routed through to produce that chunk.'
Interactive explorer shows concept attribution with specific tone (analytical, clinical) and content (e.g., Genetic alteration methodologies) categorizations.
Users can click on text chunks to inspect underlying conceptual reasoning.
Inferences
Making model reasoning transparent through concepts supports freedom of thought by enabling critical examination of AI outputs.
Interactive inspection capabilities structurally empower users to question and verify model reasoning.
Content demonstrates commitment to privacy by making training data provenance transparent and traceable, allowing users to understand what sources influenced model outputs.
FW Ratio: 60%
Observable Facts
Page states 'Training data attribution: how the concepts in that chunk distribute across training sources (ArXiv, Wikipedia, FLAN, etc.)'
Interactive explorer shows which training sources contributed to model outputs, creating data provenance visibility.
Users can click on output chunks to reveal training data sources influencing that text.
Inferences
Training data provenance tracing supports privacy by enabling users to understand and potentially object to data sources used in model training.
The design prioritizes transparency about data lineage, which indirectly supports privacy protection through informed awareness.
Content demonstrates commitment to property rights and data ownership transparency by making training data sources explicitly traceable, enabling users to understand intellectual property inputs.
FW Ratio: 50%
Observable Facts
Page explicitly describes 'Training data provenance for any generated chunk' as a key capability.
Users can view which training sources (ArXiv, Wikipedia, FLAN) contributed to outputs, enabling transparency about data ownership and sources.
Inferences
Training data attribution supports property rights awareness by making data lineage explicit.
Transparency about data sources respects intellectual property considerations embedded in training.
Content advocates for participation in scientific decision-making by releasing detailed technical information about model architecture, performance metrics, and interpretability mechanisms, enabling users to evaluate and contribute to AI development.
FW Ratio: 60%
Observable Facts
Page provides detailed performance benchmarks across seven tasks, enabling public evaluation of claimed capabilities.
Architecture is described as built on 'causal discrete diffusion model backbone' with explicit decomposition into concept pathways.
Planned publication of posts on 'Concept discovery,' 'Alignment without fine-tuning,' and 'The case for inherent interpretability' invites public discourse.
Inferences
Transparency in technical details and metrics supports informed participation in evaluating AI systems.
DCP modifier for mission (+0.2) applies: organizational commitment to interpretability and transparency advance shared scientific understanding.
Content implicitly affirms equal dignity by framing interpretability as a universal capability applicable to all users regardless of technical background, supporting equal participation in AI governance.
FW Ratio: 60%
Observable Facts
Page includes an interactive model explorer with keyboard navigation capabilities.
Content describes concepts in both 'tone' and 'content' categories, suggesting accessibility across different user interests.
Model is released as a base model without discriminatory access restrictions.
Inferences
The interactive interface design treats users as equal participants capable of exploring model behavior directly.
Base model release without paywall or restricted access affirms principles of non-discrimination.
Content implicitly supports asylum and protection by providing transparent, open tools that any person can access and use, regardless of national origin or status.
FW Ratio: 67%
Observable Facts
Model released without identity verification, nationality checks, or asylum-specific restrictions.
Public distribution channels available globally to any internet user.
Inferences
Open distribution enables protection regardless of legal status or national origin.
Content implicitly supports social and economic rights through open-source release enabling anyone to participate in AI development and knowledge creation.
FW Ratio: 67%
Observable Facts
Free access to model weights eliminates financial barrier to engaging with state-of-the-art AI technology.
Code release on GitHub enables anyone to contribute improvements and extensions.
Inferences
Open-source model removes economic gatekeeping that would restrict participation to funded institutions.
Content implicitly supports community responsibility by releasing interpretable AI that enables users to understand and verify model behavior, placing interpretability responsibility on both developer and user.
FW Ratio: 67%
Observable Facts
Page invites users to 'click on any highlighted chunk of the output' to inspect reasoning, placing burden of verification on user.
Model's transparency enables individual assessment of outputs rather than reliance on institutional assertions about safety.
Inferences
Interactive design distributes responsibility for understanding outputs between developer and user.
Content implicitly supports freedom of peaceful assembly by providing transparent tools that enable collaborative development and shared scientific understanding around AI interpretability.
FW Ratio: 60%
Observable Facts
Code repository on GitHub enables community contribution and collaborative development.
Blog post is published under Guide Labs organizational name, suggesting collaborative authorship.
Model release invites follow-up engagement through promised 'deep dives' and community exploration.
Inferences
Open-source model enables collective assembly of researchers around shared interpretability goals.
GitHub platform structurally supports collaborative association and knowledge sharing.
Content implicitly supports social order through transparent, interpretable AI that reduces risk of harmful model behavior going undetected or uncontrolled.
FW Ratio: 67%
Observable Facts
Page states 'Alignment without fine-tuning: replace thousands of safety training examples with a handful of concept-level interventions.'
Model's interpretability enables 'suppressing or amplifying specific concepts at inference time without retraining.'
Inferences
Explicit control mechanisms support mitigation of harmful outputs, contributing to social order.
No privacy policy or data handling disclosure observable on provided content.
Terms of Service
—
No terms of service or user agreement observable on provided content.
Identity & Mission
Mission
+0.20
Article 27
Organization's mission emphasizes interpretability and transparency in AI systems, with open-source code and model weights released publicly, advancing shared scientific understanding.
Editorial Code
—
No editorial standards or corrections policy observable on provided content.
Ownership
—
Guide Labs identified as publisher/organization; private entity status not confirmed from provided content.
Access & Distribution
Access Model
+0.25
Article 19 Article 27
Model weights available on HuggingFace, code on GitHub, and package on PyPI—all standard open-source distribution channels supporting broad access and participation.
Ad/Tracking
—
No advertising or tracking mechanisms observable in provided content.
Accessibility
+0.15
Article 26
Interactive model explorer with keyboard navigation and semantic HTML structure supports accessibility. No alt-text provided for technical visualizations or chart images.
Open-source base model with no built-in content filters, no mandatory safety fine-tuning, and open distribution channels structurally maximize expressive capability. Release includes code and weights enabling infinite instantiation.
Content strongly advocates for participation in scientific advancement and cultural life by releasing interpretable model that enables scientific community to understand AI internals and contribute to knowledge about model behavior.
Open-source distribution through multiple platforms (HuggingFace, GitHub, PyPI) removes geographic barriers to model access and use. No licensing restrictions observable.
Interactive model explorer provides hands-on educational experience. Open-source code and model support learning and skill development. No paywalls or access restrictions.
Interactive interface enables users to view training data attribution for any generated chunk, supporting privacy awareness and data source transparency.
Open-source release of model weights, code, and packages on public platforms (HuggingFace, GitHub, PyPI) structurally enables broad participation in scientific understanding and AI development.
Interactive interface allows users to inspect concept attributions and training data sources, structurally supporting scrutiny of model beliefs and reasoning.
Training data attribution information is provided to all users through the interactive interface, supporting informed engagement with property/data lineage.
Interactive explorer and promised 'deep dives' invite public evaluation and scrutiny of model capabilities, supporting participatory understanding of AI systems.
Open-source model and code reduce barriers to participating in AI research, which can enable economic participation without requiring proprietary access.
Reference to academic paper 'Scaling Interpretable Models to 8B' and performance benchmarks against established models (LLaMA2-7B, Deepseek-7B) without full citations or links.
exaggeration
Claim of 'first interpretable language model' at 8B scale when other interpretability approaches exist; framing as unprecedented breakthrough.
build 1ad9551+j7zs · deployed 2026-03-02 09:09 UTC · evaluated 2026-03-02 13:57:54 UTC
Support HN HRCB
Each evaluation uses real API credits. HN HRCB runs on donations — no ads, no paywalls.
If you find it useful, please consider helping keep it running.