+0.29 Steering interpretable language models with concept algebra (www.guidelabs.ai S:+0.52 )
75 points by luulinh90s 4 days ago | 8 comments on HN | Moderate positive Contested Editorial · v3.7 · 2026-02-28 09:54:05 0
Summary AI Transparency & Scientific Access Advocates
This research announcement describes Steerling-8B, an interpretable language model enabling direct concept-level control at inference time. The work engages primarily with rights to freedom of expression (Article 19), education (Article 26), and participation in scientific/cultural life (Article 27) through open-source distribution of model weights, code, and comprehensive technical documentation. The content advocates for democratizing AI transparency and development capabilities, emphasizing that enabling reliable AI control requires architectural design that makes concepts human-understandable and globally accessible.
Article Heatmap
Preamble: +0.35 — Preamble P Article 1: +0.15 — Freedom, Equality, Brotherhood 1 Article 2: ND — Non-Discrimination Article 2: No Data — Non-Discrimination 2 Article 3: ND — Life, Liberty, Security Article 3: No Data — Life, Liberty, Security 3 Article 4: ND — No Slavery Article 4: No Data — No Slavery 4 Article 5: ND — No Torture Article 5: No Data — No Torture 5 Article 6: +0.10 — Legal Personhood 6 Article 7: ND — Equality Before Law Article 7: No Data — Equality Before Law 7 Article 8: ND — Right to Remedy Article 8: No Data — Right to Remedy 8 Article 9: ND — No Arbitrary Detention Article 9: No Data — No Arbitrary Detention 9 Article 10: ND — Fair Hearing Article 10: No Data — Fair Hearing 10 Article 11: ND — Presumption of Innocence Article 11: No Data — Presumption of Innocence 11 Article 12: +0.05 — Privacy 12 Article 13: ND — Freedom of Movement Article 13: No Data — Freedom of Movement 13 Article 14: ND — Asylum Article 14: No Data — Asylum 14 Article 15: ND — Nationality Article 15: No Data — Nationality 15 Article 16: ND — Marriage & Family Article 16: No Data — Marriage & Family 16 Article 17: ND — Property Article 17: No Data — Property 17 Article 18: +0.35 — Freedom of Thought 18 Article 19: +0.59 — Freedom of Expression 19 Article 20: ND — Assembly & Association Article 20: No Data — Assembly & Association 20 Article 21: ND — Political Participation Article 21: No Data — Political Participation 21 Article 22: ND — Social Security Article 22: No Data — Social Security 22 Article 23: +0.10 — Work & Equal Pay 23 Article 24: ND — Rest & Leisure Article 24: No Data — Rest & Leisure 24 Article 25: ND — Standard of Living Article 25: No Data — Standard of Living 25 Article 26: +0.38 — Education 26 Article 27: +0.58 — Cultural Participation 27 Article 28: +0.20 — Social & International Order 28 Article 29: +0.35 — Duties to Community 29 Article 30: ND — No Destruction of Rights Article 30: No Data — No Destruction of Rights 30
Negative Neutral Positive No Data
Aggregates
Editorial Mean +0.29 Structural Mean +0.52
Weighted Mean +0.35 Unweighted Mean +0.29
Max +0.59 Article 19 Min +0.05 Article 12
Signal 11 No Data 20
Volatility 0.18 (Medium)
Negative 0 Channels E: 0.6 S: 0.4
SETL +0.02 Editorial-dominant
FW Ratio 55% 23 facts · 19 inferences
Evidence 20% coverage
3H 4M 4L 20 ND
Theme Radar
Foundation Security Legal Privacy & Movement Personal Expression Economic & Social Cultural Order & Duties Foundation: 0.25 (2 articles) Security: 0.00 (0 articles) Legal: 0.10 (1 articles) Privacy & Movement: 0.05 (1 articles) Personal: 0.35 (1 articles) Expression: 0.59 (1 articles) Economic & Social: 0.10 (1 articles) Cultural: 0.48 (2 articles) Order & Duties: 0.28 (2 articles)
HN Discussion 2 top-level · 3 replies
giang_at_glai 2026-02-26 06:11 UTC link
Author here.

This post shows “concept algebra” on language model: inject, suppress, and compose human-understandable concepts at inference time (no retraining, no prompt engineering).

There’s an interactive demo on the post.

Would love feedback on: (1) what steering tasks you’d benchmark, (2) failure cases you’d want to see, (3) whether this kind of compositional control is useful in real products.

Related: https://news.ycombinator.com/item?id=47131225

AIorNot 2026-02-27 07:12 UTC link
How good would this steering be for function calling as part of an agent to keep agent on task or gaurdrail
anon291 2026-02-26 20:44 UTC link
I would personally like some quantification of how good this is compared to just replacing the system prompt of an off the shelf 8B parameter language model.

The suppression bit is very powerful. I would like to see a quantification of how often a steered 'normal' language model will mention things you asked it to suppress vs how often this one does

didgeoridoo 2026-02-27 03:56 UTC link
Hi! Have you published the concept dictionary yet? I’m looking into using Steerling to investigate how different moral scenarios elicit various responses in LLMs (using Haidt MFT concepts mostly), and my first few inference runs have been hamstrung by not having a canonical mapping of concepts to IDs. Thanks!
luulinh90s 2026-02-27 08:17 UTC link
We haven’t benchmarked our steering for scaffolding function-calling in an agent loop yet (and the model we are using is just a base model), so I can’t give a quantitative claim. But concept-based steering should be a good fit for keeping the agent on task and enforcing behavioral guardrails around tool use.

In practice, you can treat concepts as soft/hard constraints to bias the agent toward: (1) calling tools only when needed, (2) selecting the right tool/function, or (3) using the correct argument schema.

Editorial Channel
What the content says
+0.60
Article 27 Cultural Participation
High Practice Framing
Editorial
+0.60
SETL
+0.17

The post emphasizes participation in scientific understanding: advancing 'shared scientific understanding' through transparent research design. Complete methodology and evaluation framework disclosed, enabling global scientific community validation and extension. DCP mission modifier (+0.2) applies—organization emphasizes 'interpretability and transparency in AI systems' and 'advancing shared scientific understanding.'

+0.55
Article 19 Freedom of Expression
High Practice Advocacy
Editorial
+0.55
SETL
-0.25

The post advocates for open-source distribution and transparency as superior to closed alternatives ('fundamentally different from prompt engineering, RLHF, or post-hoc methods'). Strong editorial emphasis on enabling freedom of expression through shared technical knowledge.

+0.40
Article 26 Education
High Coverage Practice
Editorial
+0.40
SETL
+0.14

The post provides detailed technical documentation including architecture explanation ('an architectural bottleneck that forces every prediction through human-interpretable concepts'), methodology, code examples, and quantitative evaluation framework. Comprehensive educational content enables learning about AI interpretability and model design.

+0.35
Preamble Preamble
Medium Advocacy Framing
Editorial
+0.35
SETL
ND

The post emphasizes 'human-understandable concepts' and rational, transparent design as the foundation for AI systems, aligning with UDHR Preamble's appeal to 'reason and conscience.' The work promotes human dignity by making AI systems interpretable and controllable rather than opaque.

+0.35
Article 18 Freedom of Thought
Medium Advocacy Practice
Editorial
+0.35
SETL
ND

The post describes enabling users to 'add, remove, and compose human-understandable concepts' to control AI behavior, supporting freedom of thought and conscience. Users can suppress unwanted model outputs, protecting intellectual freedom from unwanted AI responses.

+0.35
Article 29 Duties to Community
Medium Practice Advocacy
Editorial
+0.35
SETL
ND

The DCP notes the organization's mission emphasizes 'advancing shared scientific understanding.' The work fulfills community duty by freely publishing research and code rather than restricting knowledge to proprietary advantage.

+0.20
Article 28 Social & International Order
Medium Advocacy
Editorial
+0.20
SETL
ND

The post describes steering applications for safety-critical domains: 'content moderation that must suppress toxicity yet preserve fluency' and 'health assistant that needs to provide medical guidance.' Interpretable, controllable AI contributes to just and safe systems.

+0.15
Article 1 Freedom, Equality, Brotherhood
Low Advocacy
Editorial
+0.15
SETL
ND

The post describes enabling 'reliable, composable, fine-grained control' available to any user, implicitly suggesting equal access to AI control mechanisms. Open-source distribution supports equality of access.

+0.10
Article 6 Legal Personhood
Low Framing
Editorial
+0.10
SETL
ND

The post explains how the concept module makes AI decision-making 'human-interpretable,' relating to recognizing entities worthy of understanding. Making internal AI processes externally visible relates distantly to recognition.

+0.10
Article 23 Work & Equal Pay
Low Advocacy
Editorial
+0.10
SETL
ND

The page advertises research positions (Careers link) and open-source contribution opportunities. Open-source model enables distributed research participation beyond formal employment.

+0.05
Article 12 Privacy
Low Practice
Editorial
+0.05
SETL
ND

The post focuses on accessing and modifying internal AI representations for human benefit. This differs from human privacy protection; it concerns AI transparency rather than human data privacy. Mildly positive as it emphasizes user agency.

ND
Article 2 Non-Discrimination

Not addressed in this content

ND
Article 3 Life, Liberty, Security

Not addressed in this content

ND
Article 4 No Slavery

Not addressed in this content

ND
Article 5 No Torture

Not addressed in this content

ND
Article 7 Equality Before Law

Not addressed in this content

ND
Article 8 Right to Remedy

Not addressed in this content

ND
Article 9 No Arbitrary Detention

Not addressed in this content

ND
Article 10 Fair Hearing

Not addressed in this content

ND
Article 11 Presumption of Innocence

Not addressed in this content

ND
Article 13 Freedom of Movement

Not addressed in this content

ND
Article 14 Asylum

Not addressed in this content

ND
Article 15 Nationality

Not addressed in this content

ND
Article 16 Marriage & Family

Not addressed in this content

ND
Article 17 Property

Not addressed in this content

ND
Article 20 Assembly & Association

Not addressed in this content

ND
Article 21 Political Participation

Not addressed in this content

ND
Article 22 Social Security

Not addressed in this content

ND
Article 24 Rest & Leisure

Not addressed in this content

ND
Article 25 Standard of Living

Not addressed in this content

ND
Article 30 No Destruction of Rights

Not addressed in this content

Structural Channel
What the site does
Element Modifier Affects Note
Legal & Terms
Privacy
No privacy policy or data handling disclosure observable on provided content.
Terms of Service
No terms of service or user agreement observable on provided content.
Identity & Mission
Mission +0.20
Article 27
Organization's mission emphasizes interpretability and transparency in AI systems, with open-source code and model weights released publicly, advancing shared scientific understanding.
Editorial Code
No editorial standards or corrections policy observable on provided content.
Ownership
Guide Labs identified as publisher/organization; private entity status not confirmed from provided content.
Access & Distribution
Access Model +0.25
Article 19 Article 27
Model weights available on HuggingFace, code on GitHub, and package on PyPI—all standard open-source distribution channels supporting broad access and participation.
Ad/Tracking
No advertising or tracking mechanisms observable in provided content.
Accessibility +0.15
Article 26
Interactive model explorer with keyboard navigation and semantic HTML structure supports accessibility. No alt-text provided for technical visualizations or chart images.
+0.65
Article 19 Freedom of Expression
High Practice Advocacy
Structural
+0.65
Context Modifier
ND
SETL
-0.25

The domain provides actionable open-source distribution: explicit hyperlinks to HuggingFace model weights, GitHub source code, and PyPI package. All distributed without indicated paywalls or access restrictions, enabling global freedom of expression and research access. DCP access_model modifier (+0.25) applies—'Model weights available on HuggingFace, code on GitHub, and package on PyPI—all standard open-source distribution channels supporting broad access and participation.'

+0.55
Article 27 Cultural Participation
High Practice Framing
Structural
+0.55
Context Modifier
ND
SETL
+0.17

The domain provides mechanisms for active scientific participation: open-source code on GitHub, model weights on HuggingFace, methodology fully disclosed. Release of complete implementation details enables global research community to validate, reproduce, extend, and build upon the work. DCP access_model modifier (+0.25) applies—'code on GitHub... model weights available on HuggingFace... supporting broad access and participation.'

+0.35
Article 26 Education
High Coverage Practice
Structural
+0.35
Context Modifier
ND
SETL
+0.14

The site provides multiple learning channels: HuggingFace for interactive model exploration, GitHub for code study, PyPI for installation, and blog for conceptual foundation. DCP accessibility modifier (+0.15) applies—'Interactive model explorer with keyboard navigation and semantic HTML structure supports accessibility'—enabling access across different ability levels and technical backgrounds.

ND
Preamble Preamble
Medium Advocacy Framing

Not applicable at Preamble level.

ND
Article 1 Freedom, Equality, Brotherhood
Low Advocacy

Not applicable

ND
Article 2 Non-Discrimination

Not applicable

ND
Article 3 Life, Liberty, Security

Not applicable

ND
Article 4 No Slavery

Not applicable

ND
Article 5 No Torture

Not applicable

ND
Article 6 Legal Personhood
Low Framing

Not applicable

ND
Article 7 Equality Before Law

Not applicable

ND
Article 8 Right to Remedy

Not applicable

ND
Article 9 No Arbitrary Detention

Not applicable

ND
Article 10 Fair Hearing

Not applicable

ND
Article 11 Presumption of Innocence

Not applicable

ND
Article 12 Privacy
Low Practice

Not applicable

ND
Article 13 Freedom of Movement

Not applicable

ND
Article 14 Asylum

Not applicable

ND
Article 15 Nationality

Not applicable

ND
Article 16 Marriage & Family

Not applicable

ND
Article 17 Property

Not applicable

ND
Article 18 Freedom of Thought
Medium Advocacy Practice

Not applicable

ND
Article 20 Assembly & Association

Not applicable

ND
Article 21 Political Participation

Not applicable

ND
Article 22 Social Security

Not applicable

ND
Article 23 Work & Equal Pay
Low Advocacy

Not applicable

ND
Article 24 Rest & Leisure

Not applicable

ND
Article 25 Standard of Living

Not applicable

ND
Article 28 Social & International Order
Medium Advocacy

Not applicable

ND
Article 29 Duties to Community
Medium Practice Advocacy

Not applicable

ND
Article 30 No Destruction of Rights

Not applicable

Supplementary Signals
How this content communicates, beyond directional lean. Learn more
Epistemic Quality
How well-sourced and evidence-based is this content?
0.67 medium claims
Sources
0.8
Evidence
0.7
Uncertainty
0.5
Purpose
0.8
Propaganda Flags
No manipulative rhetoric detected
0 techniques detected
Emotional Tone
Emotional character: positive/negative, intensity, authority
measured
Valence
+0.6
Arousal
0.4
Dominance
0.7
Transparency
Does the content identify its author and disclose interests?
0.33
✓ Author
More signals: context, framing & audience
Solution Orientation
Does this content offer solutions or only describe problems?
0.91 solution oriented
Reader Agency
0.8
Stakeholder Voice
Whose perspectives are represented in this content?
0.35 2 perspectives
Speaks: institution
About: individualscommunity
Temporal Framing
Is this content looking backward, at the present, or forward?
present medium term
Geographic Scope
What geographic area does this content cover?
global
Complexity
How accessible is this content to a general audience?
technical high jargon domain specific
Longitudinal 1391 HN snapshots · 7 evals
+1 0 −1 HN
Audit Trail 27 entries
2026-02-28 13:32 model_divergence Cross-model spread 0.35 exceeds threshold (3 models) - -
2026-02-28 13:32 eval_success Lite evaluated: Neutral (0.00) - -
2026-02-28 13:32 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech tutorial no rights stance
2026-02-28 13:29 model_divergence Cross-model spread 0.35 exceeds threshold (3 models) - -
2026-02-28 13:29 eval_success Lite evaluated: Neutral (0.00) - -
2026-02-28 13:29 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
reasoning
ED technical AI research no rights stance
2026-02-28 13:27 model_divergence Cross-model spread 0.35 exceeds threshold (2 models) - -
2026-02-28 13:27 eval_success Lite evaluated: Neutral (0.00) - -
2026-02-28 13:27 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
reasoning
Tech tutorial no rights stance
2026-02-28 09:54 eval Evaluated by claude-haiku-4-5-20251001: +0.35 (Moderate positive)
2026-02-28 01:34 dlq_replay DLQ message 97522 replayed to EVAL_QUEUE: Steering interpretable language models with concept algebra - -
2026-02-28 00:31 eval_success Light evaluated: Neutral (0.00) - -
2026-02-28 00:31 eval Evaluated by llama-3.3-70b-wai: 0.00 (Neutral)
reasoning
Tech tutorial no rights stance
2026-02-26 23:27 eval_success Evaluated: Moderate positive (0.56) - -
2026-02-26 23:27 eval Evaluated by deepseek-v3.2: +0.56 (Moderate positive) 14,629 tokens
2026-02-26 22:36 eval_success Light evaluated: Neutral (0.00) - -
2026-02-26 22:36 eval Evaluated by llama-4-scout-wai: 0.00 (Neutral)
reasoning
ED technical AI research no rights stance
2026-02-26 22:15 dlq Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra - -
2026-02-26 22:13 rate_limit OpenRouter rate limited (429) model=llama-3.3-70b - -
2026-02-26 22:12 rate_limit OpenRouter rate limited (429) model=llama-3.3-70b - -
2026-02-26 22:11 rate_limit OpenRouter rate limited (429) model=llama-3.3-70b - -
2026-02-26 18:43 dlq Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra - -
2026-02-26 18:40 dlq Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra - -
2026-02-26 18:40 dlq Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra - -
2026-02-26 18:39 dlq Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra - -
2026-02-26 18:38 dlq Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra - -
2026-02-26 18:38 dlq Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra - -