+0.29 Steering interpretable language models with concept algebra

Name: HRCB Evaluation: Steering interpretable language models with concept algebra
Item: Steering interpretable language models with concept algebra
Rating: 0.349
Author: Human Rights Observatory

Model: @cf/meta/llama-3.3-70b-instruct-fp8-fast lite 0.00 @cf/meta/llama-4-scout-17b-16e-instruct lite 0.00 claude-haiku-4-5-20251001 +0.29 meta-llama/llama-3.3-70b-instruct:free ND nvidia/nemotron-3-nano-30b-a3b:free ND Compare

+0.29	Steering interpretable language models with concept algebra (www.guidelabs.ai S:+0.52 )
	75 points by luulinh90s 4 days ago \| 8 comments on HN \| Moderate positive Contested Editorial · v3.7 · 2026-02-28 09:54:05 0

Summary AI Transparency & Scientific Access Advocates

This research announcement describes Steerling-8B, an interpretable language model enabling direct concept-level control at inference time. The work engages primarily with rights to freedom of expression (Article 19), education (Article 26), and participation in scientific/cultural life (Article 27) through open-source distribution of model weights, code, and comprehensive technical documentation. The content advocates for democratizing AI transparency and development capabilities, emphasizing that enabling reliable AI control requires architectural design that makes concepts human-understandable and globally accessible.

Article Heatmap

Negative Neutral Positive No Data

Aggregates

Editorial Mean	+0.29	Structural Mean	+0.52
Weighted Mean	+0.35	Unweighted Mean	+0.29
Max	+0.59 Article 19	Min	+0.05 Article 12
Signal	11	No Data	20
Volatility	0.18 (Medium)
Negative	0	Channels	E: 0.6 S: 0.4
SETL ℹ	+0.02	Editorial-dominant
FW Ratio ℹ	55%	23 facts · 19 inferences

Evidence 20% coverage ℹ

 3H  4M  4L  20 ND 

Theme Radar

HN Discussion 2 top-level · 3 replies

giang_at_glai 2026-02-26 06:11 UTC link

Author here.

This post shows “concept algebra” on language model: inject, suppress, and compose human-understandable concepts at inference time (no retraining, no prompt engineering).

There’s an interactive demo on the post.

Would love feedback on: (1) what steering tasks you’d benchmark, (2) failure cases you’d want to see, (3) whether this kind of compositional control is useful in real products.

AIorNot 2026-02-27 07:12 UTC link

How good would this steering be for function calling as part of an agent to keep agent on task or gaurdrail

anon291 2026-02-26 20:44 UTC link

I would personally like some quantification of how good this is compared to just replacing the system prompt of an off the shelf 8B parameter language model.

The suppression bit is very powerful. I would like to see a quantification of how often a steered 'normal' language model will mention things you asked it to suppress vs how often this one does

didgeoridoo 2026-02-27 03:56 UTC link

Hi! Have you published the concept dictionary yet? I’m looking into using Steerling to investigate how different moral scenarios elicit various responses in LLMs (using Haidt MFT concepts mostly), and my first few inference runs have been hamstrung by not having a canonical mapping of concepts to IDs. Thanks!

luulinh90s 2026-02-27 08:17 UTC link

We haven’t benchmarked our steering for scaffolding function-calling in an agent loop yet (and the model we are using is just a base model), so I can’t give a quantitative claim. But concept-based steering should be a good fit for keeping the agent on task and enforcing behavioral guardrails around tool use.

In practice, you can treat concepts as soft/hard constraints to bias the agent toward: (1) calling tools only when needed, (2) selecting the right tool/function, or (3) using the correct argument schema.

Editorial Channel

What the content says

+0.60

Article 27 Cultural Participation

High Practice Framing

Editorial

+0.60

SETL

+0.17

The post emphasizes participation in scientific understanding: advancing 'shared scientific understanding' through transparent research design. Complete methodology and evaluation framework disclosed, enabling global scientific community validation and extension. DCP mission modifier (+0.2) applies—organization emphasizes 'interpretability and transparency in AI systems' and 'advancing shared scientific understanding.'

+0.55

Article 19 Freedom of Expression

High Practice Advocacy

Editorial

+0.55

SETL

-0.25

The post advocates for open-source distribution and transparency as superior to closed alternatives ('fundamentally different from prompt engineering, RLHF, or post-hoc methods'). Strong editorial emphasis on enabling freedom of expression through shared technical knowledge.

+0.40

Article 26 Education

High Coverage Practice

Editorial

+0.40

SETL

+0.14

The post provides detailed technical documentation including architecture explanation ('an architectural bottleneck that forces every prediction through human-interpretable concepts'), methodology, code examples, and quantitative evaluation framework. Comprehensive educational content enables learning about AI interpretability and model design.

+0.35

Preamble Preamble

Medium Advocacy Framing

Editorial

+0.35

SETL

The post emphasizes 'human-understandable concepts' and rational, transparent design as the foundation for AI systems, aligning with UDHR Preamble's appeal to 'reason and conscience.' The work promotes human dignity by making AI systems interpretable and controllable rather than opaque.

+0.35

Article 18 Freedom of Thought

Medium Advocacy Practice

Editorial

+0.35

SETL

The post describes enabling users to 'add, remove, and compose human-understandable concepts' to control AI behavior, supporting freedom of thought and conscience. Users can suppress unwanted model outputs, protecting intellectual freedom from unwanted AI responses.

+0.35

Article 29 Duties to Community

Medium Practice Advocacy

Editorial

+0.35

SETL

The DCP notes the organization's mission emphasizes 'advancing shared scientific understanding.' The work fulfills community duty by freely publishing research and code rather than restricting knowledge to proprietary advantage.

+0.20

Article 28 Social & International Order

Medium Advocacy

Editorial

+0.20

SETL

The post describes steering applications for safety-critical domains: 'content moderation that must suppress toxicity yet preserve fluency' and 'health assistant that needs to provide medical guidance.' Interpretable, controllable AI contributes to just and safe systems.

+0.15

Article 1 Freedom, Equality, Brotherhood

Low Advocacy

Editorial

+0.15

SETL

The post describes enabling 'reliable, composable, fine-grained control' available to any user, implicitly suggesting equal access to AI control mechanisms. Open-source distribution supports equality of access.

+0.10

Article 6 Legal Personhood

Low Framing

Editorial

+0.10

SETL

The post explains how the concept module makes AI decision-making 'human-interpretable,' relating to recognizing entities worthy of understanding. Making internal AI processes externally visible relates distantly to recognition.

+0.10

Article 23 Work & Equal Pay

Low Advocacy

Editorial

+0.10

SETL

The page advertises research positions (Careers link) and open-source contribution opportunities. Open-source model enables distributed research participation beyond formal employment.

+0.05

Article 12 Privacy

Low Practice

Editorial

+0.05

SETL

The post focuses on accessing and modifying internal AI representations for human benefit. This differs from human privacy protection; it concerns AI transparency rather than human data privacy. Mildly positive as it emphasizes user agency.

Article 2 Non-Discrimination

Not addressed in this content

Article 3 Life, Liberty, Security

Not addressed in this content

Article 4 No Slavery

Not addressed in this content

Article 5 No Torture

Not addressed in this content

Article 7 Equality Before Law

Not addressed in this content

Article 8 Right to Remedy

Not addressed in this content

Article 9 No Arbitrary Detention

Not addressed in this content

Article 10 Fair Hearing

Not addressed in this content

Article 11 Presumption of Innocence

Not addressed in this content

Article 13 Freedom of Movement

Not addressed in this content

Article 14 Asylum

Not addressed in this content

Article 15 Nationality

Not addressed in this content

Article 16 Marriage & Family

Not addressed in this content

Article 17 Property

Not addressed in this content

Article 20 Assembly & Association

Not addressed in this content

Article 21 Political Participation

Not addressed in this content

Article 22 Social Security

Not addressed in this content

Article 24 Rest & Leisure

Not addressed in this content

Article 25 Standard of Living

Not addressed in this content

Article 30 No Destruction of Rights

Not addressed in this content

Structural Channel

What the site does

Domain Context Profile

Element	Modifier	Affects	Note
Legal & Terms
Privacy	—		No privacy policy or data handling disclosure observable on provided content.
Terms of Service	—		No terms of service or user agreement observable on provided content.
Identity & Mission
Mission	+0.20	Article 27	Organization's mission emphasizes interpretability and transparency in AI systems, with open-source code and model weights released publicly, advancing shared scientific understanding.
Editorial Code	—		No editorial standards or corrections policy observable on provided content.
Ownership	—		Guide Labs identified as publisher/organization; private entity status not confirmed from provided content.
Access & Distribution
Access Model	+0.25	Article 19 Article 27	Model weights available on HuggingFace, code on GitHub, and package on PyPI—all standard open-source distribution channels supporting broad access and participation.
Ad/Tracking	—		No advertising or tracking mechanisms observable in provided content.
Accessibility	+0.15	Article 26	Interactive model explorer with keyboard navigation and semantic HTML structure supports accessibility. No alt-text provided for technical visualizations or chart images.

+0.65

Article 19 Freedom of Expression

High Practice Advocacy

Structural

+0.65

Context Modifier

SETL

-0.25

The domain provides actionable open-source distribution: explicit hyperlinks to HuggingFace model weights, GitHub source code, and PyPI package. All distributed without indicated paywalls or access restrictions, enabling global freedom of expression and research access. DCP access_model modifier (+0.25) applies—'Model weights available on HuggingFace, code on GitHub, and package on PyPI—all standard open-source distribution channels supporting broad access and participation.'

+0.55

Article 27 Cultural Participation

High Practice Framing

Structural

+0.55

Context Modifier

SETL

+0.17

The domain provides mechanisms for active scientific participation: open-source code on GitHub, model weights on HuggingFace, methodology fully disclosed. Release of complete implementation details enables global research community to validate, reproduce, extend, and build upon the work. DCP access_model modifier (+0.25) applies—'code on GitHub... model weights available on HuggingFace... supporting broad access and participation.'

+0.35

Article 26 Education

High Coverage Practice

Structural

+0.35

Context Modifier

SETL

+0.14

The site provides multiple learning channels: HuggingFace for interactive model exploration, GitHub for code study, PyPI for installation, and blog for conceptual foundation. DCP accessibility modifier (+0.15) applies—'Interactive model explorer with keyboard navigation and semantic HTML structure supports accessibility'—enabling access across different ability levels and technical backgrounds.

Preamble Preamble

Medium Advocacy Framing

Not applicable at Preamble level.

Article 1 Freedom, Equality, Brotherhood

Low Advocacy

Not applicable

Article 2 Non-Discrimination

Not applicable

Article 3 Life, Liberty, Security

Not applicable

Article 4 No Slavery

Not applicable

Article 5 No Torture

Not applicable

Article 6 Legal Personhood

Low Framing

Not applicable

Article 7 Equality Before Law

Not applicable

Article 8 Right to Remedy

Not applicable

Article 9 No Arbitrary Detention

Not applicable

Article 10 Fair Hearing

Not applicable

Article 11 Presumption of Innocence

Not applicable

Article 12 Privacy

Low Practice

Not applicable

Article 13 Freedom of Movement

Not applicable

Article 14 Asylum

Not applicable

Article 15 Nationality

Not applicable

Article 16 Marriage & Family

Not applicable

Article 17 Property

Not applicable

Article 18 Freedom of Thought

Medium Advocacy Practice

Not applicable

Article 20 Assembly & Association

Not applicable

Article 21 Political Participation

Not applicable

Article 22 Social Security

Not applicable

Article 23 Work & Equal Pay

Low Advocacy

Not applicable

Article 24 Rest & Leisure

Not applicable

Article 25 Standard of Living

Not applicable

Article 28 Social & International Order

Medium Advocacy

Not applicable

Article 29 Duties to Community

Medium Practice Advocacy

Not applicable

Article 30 No Destruction of Rights

Not applicable

Supplementary Signals

How this content communicates, beyond directional lean. Learn more

Epistemic Quality ℹ

How well-sourced and evidence-based is this content?

0.67 medium claims

Sources		0.8
Evidence		0.7
Uncertainty		0.5
Purpose		0.8

Propaganda Flags ℹ

No manipulative rhetoric detected

0 techniques detected

Emotional Tone ℹ

Emotional character: positive/negative, intensity, authority

measured

Valence		+0.6
Arousal		0.4
Dominance		0.7

Transparency ℹ

Does the content identify its author and disclose interests?

0.33

✓ Author

More signals: context, framing & audience

Solution Orientation ℹ

Does this content offer solutions or only describe problems?

0.91 solution oriented

Reader Agency

0.8

Stakeholder Voice ℹ

Whose perspectives are represented in this content?

0.35 2 perspectives

Speaks: institution

About: individualscommunity

Temporal Framing ℹ

Is this content looking backward, at the present, or forward?

present medium term

Geographic Scope ℹ

What geographic area does this content cover?

global

Complexity ℹ

How accessible is this content to a general audience?

technical high jargon domain specific

Longitudinal 1391 HN snapshots · 7 evals

Audit Trail 27 entries

2026-02-28 13:32	model_divergence	Cross-model spread 0.35 exceeds threshold (3 models)	- -
2026-02-28 13:32	eval_success	Lite evaluated: Neutral (0.00)	- -
2026-02-28 13:32	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
	reasoning Tech tutorial no rights stance
2026-02-28 13:29	model_divergence	Cross-model spread 0.35 exceeds threshold (3 models)	- -
2026-02-28 13:29	eval_success	Lite evaluated: Neutral (0.00)	- -
2026-02-28 13:29	eval	Evaluated by llama-4-scout-wai: 0.00 (Neutral) 0.00
	reasoning ED technical AI research no rights stance
2026-02-28 13:27	model_divergence	Cross-model spread 0.35 exceeds threshold (2 models)	- -
2026-02-28 13:27	eval_success	Lite evaluated: Neutral (0.00)	- -
2026-02-28 13:27	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral) 0.00
	reasoning Tech tutorial no rights stance
2026-02-28 09:54	eval	Evaluated by claude-haiku-4-5-20251001: +0.35 (Moderate positive)
2026-02-28 01:34	dlq_replay	DLQ message 97522 replayed to EVAL_QUEUE: Steering interpretable language models with concept algebra	- -
2026-02-28 00:31	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 00:31	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral)
	reasoning Tech tutorial no rights stance
2026-02-26 23:27	eval_success	Evaluated: Moderate positive (0.56)	- -
2026-02-26 23:27	eval	Evaluated by deepseek-v3.2: +0.56 (Moderate positive) 14,629 tokens
2026-02-26 22:36	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-26 22:36	eval	Evaluated by llama-4-scout-wai: 0.00 (Neutral)
	reasoning ED technical AI research no rights stance
2026-02-26 22:15	dlq	Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra	- -
2026-02-26 22:13	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-26 22:12	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-26 22:11	rate_limit	OpenRouter rate limited (429) model=llama-3.3-70b	- -
2026-02-26 18:43	dlq	Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra	- -
2026-02-26 18:40	dlq	Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra	- -
2026-02-26 18:40	dlq	Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra	- -
2026-02-26 18:39	dlq	Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra	- -
2026-02-26 18:38	dlq	Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra	- -
2026-02-26 18:38	dlq	Dead-lettered after 1 attempts: Steering interpretable language models with concept algebra	- -

build f220d37+y7wl · deployed 2026-03-02 19:08 UTC · evaluated 2026-03-02 19:05:25 UTC