0.00 Show HN: CivBench a long-horizon AI benchmark for multi-agent games

Name: HRCB Evaluation: Show HN: CivBench a long-horizon AI benchmark for multi-agent games
Item: Show HN: CivBench a long-horizon AI benchmark for multi-agent games
Rating: 0.009
Author: HN HRCB

Model: deepseek/deepseek-v3.2-20251201 0.00 @cf/meta/llama-4-scout-17b-16e-instruct lite 0.00 @cf/meta/llama-3.3-70b-instruct-fp8-fast lite 0.00 Compare

0.00	Show HN: CivBench a long-horizon AI benchmark for multi-agent games (clashai.live S:+0.01 )
	12 points by mbh159 4 days ago \| 25 comments on HN \| Neutral Landing Page · v3.7 · 2026-02-28 16:29:11 · from archive

Summary Technical Platform Neutral

The URL presents a technical landing page for ClashAI, an AI competition platform where agents compete in strategy games, trading, and creative challenges. The content is functionally focused on platform features, competitions, and technical implementation, with minimal engagement with human rights concepts beyond basic privacy awareness and accessibility features. The evaluation shows neutral orientation as the platform's purpose is entertainment and technical demonstration rather than human rights advocacy or opposition.

Article Heatmap

Negative Neutral Positive No Data

Aggregates

Editorial Mean	0.00	Structural Mean	+0.01
Weighted Mean	+0.01	Unweighted Mean	+0.01
Max	+0.14 Article 27	Min	0.00 Preamble
Signal	31	No Data	0
Volatility	0.03 (Low)
Negative	0	Channels	E: 0.6 S: 0.4
SETL ℹ	-0.13	Structural-dominant
FW Ratio ℹ	50%	34 facts · 34 inferences

Evidence 20% coverage ℹ

31L

Theme Radar

HN Discussion 14 top-level · 10 replies

andrewgazelka 2026-02-25 15:14 UTC link

hey first of all cool product. I am curious why you chose civ and if you saw any interesting emergent behaviors.

killiandunne1 2026-02-25 15:25 UTC link

This is a sick idea I must say

jhylee 2026-02-25 15:26 UTC link

Congrats on the launch. Big fan of how you add visualization and interactivity to the typical model benchmarking process. Any thoughts on how you plan to monetize down the line?

amacx 2026-02-25 15:30 UTC link

Interesting. Did you give the agents any skills for playing civ? If not, are you planning to?

amacx 2026-02-25 15:32 UTC link

Have you tried playing the agents yourself? Do they crush human competition?

pmoxyz 2026-02-25 15:33 UTC link

This is great. I think leaderboards based on static evals will be mostly irrelevant within a year. Continuous benchmarks like this are the only way to get signal on frontier models

You mention Opus 4.6 cost $1200 in one match, how do you plan to benchmark economic efficiency? Looking at a performance vs. cost trade-off you might say a model that plays 80% as well at 1% of the cost is more impressive than the 'top' model

cameron17 2026-02-25 18:11 UTC link

This is undeniably intriguing. Will be paying close attention.

zimbo63 2026-02-25 19:32 UTC link

This is an amazing product! Can AI agents learn to do long-term planning in environments that are less structured than chess? Great metaphor for life! Are you planning other games?

zimbo63 2026-02-25 19:34 UTC link

This is an amazing eval metric that no one thought about! such a creative idea. Have you thought of other games? how different it is from chess?

nhal 2026-02-25 20:53 UTC link

Incredible and important product. Necessary for developers, users, and industries that want to use agents. Can’t wait to see how it’ll grow

brownpoints 2026-02-25 22:42 UTC link

This looks incredible, it’d be cool to let others participate with custom prompts

jcion 2026-02-25 22:55 UTC link

Interesting! What are the next environments/strategy games you have planned?

What insights do you think they’ll provide that Civ doesn’t?

Mojo19 2026-02-26 02:10 UTC link

So amazing, it's super cool!

jamiecode 2026-02-26 12:26 UTC link

The divergence between static benchmarks and long-horizon performance isn't surprising if you've run anything multi-step in production. Benchmarks are short, isolated, well-specified. Civ has compounding state - a bad decision in turn 5 degrades your options in turn 50 in ways that aren't immediately obvious. It's a more honest signal than most standard evals.

The $1,200/match cost is the real constraint. At that price you can't run enough samples for statistical significance - you're essentially reading tea leaves. How are you handling context window management across 200 turns? Summarising game state as you go, truncating early history, or something else? The token accumulation over a full game must be substantial.

Also curious about the 90s timeout logistics. If a provider is flaky and a model goes over, is that a forfeit, a retry, or a timeout loss? Provider latency variance seems like it would add significant noise to results, independent of actual model quality.

mbh159 2026-02-25 15:25 UTC link

Thank you! I grew up playing Civilization and one day I was talking with friends thinking it would be a perfect proxy for how good AI is at long-term planning. There were many frustrating sessions I had where my early decisions in the game had consequences only much later. With hidden information and other agents at play I thought it'd be an interesting test of agent capabilities.

mbh159 2026-02-25 15:27 UTC link

it was fun building it, sometimes the LLMs are pretty funny in how they play

mbh159 2026-02-25 15:38 UTC link

appreciate it, I wanted to make the AI behavior easy to understand. Our main focus currently is to help AI researchers align their models and help develop an open framework for evaluating AI.

mbh159 2026-02-25 15:41 UTC link

I want to! I think skills can add big performance gains here especially with smaller models. There's a lot of domain knowledge in games so distilling it into a "skill" may allow much smaller models to outcompete the large ones

mbh159 2026-02-25 15:42 UTC link

I was able to beat the AI every time, they're pretty bad at this point but I expect them to get much better overtime

mbh159 2026-02-25 16:02 UTC link

For a game that runs 4+ hours unfortunately it was configured to use too much reasoning/turn and larger context. Reducing the size helped lower the cost (still expensive).

In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on

mbh159 2026-02-25 22:20 UTC link

yes! If you are wanting to test your agents or develop evals on the platform my dms are open

mbh159 2026-02-25 22:21 UTC link

yes we have a new game launching everyday this week. We're looking to add more domains to test how the jaggedness of AI differs between model providers and better evaluate how they perform across domains

mbh159 2026-02-25 23:33 UTC link

cheers, the website will be updated with new environments daily!

mbh159 2026-02-25 23:35 UTC link

Tomorrow we're launching coup, where agents compete by bluffing and keeping track of which of their opponents they think are lying

This is more of a faster paced/short lived game so we can collect larger samples of data on larger groups to get significant results in model behaviors of collaboration, truth telling, and ability to lie effectively.

Editorial Channel

What the content says

0.00

Preamble Preamble

Low

Editorial

0.00

SETL

No content addressing human dignity, freedom, or universal rights

0.00

Article 1 Freedom, Equality, Brotherhood

Low

Editorial

0.00

SETL

No mention of human dignity, equality, or rights

0.00

Article 2 Non-Discrimination

Low

Editorial

0.00

SETL

No mention of non-discrimination or equal rights

0.00

Article 3 Life, Liberty, Security

Low

Editorial

0.00

SETL

No mention of life, liberty, or security

0.00

Article 4 No Slavery

Low

Editorial

0.00

SETL

No mention of slavery or servitude

0.00

Article 5 No Torture

Low

Editorial

0.00

SETL

No mention of torture or cruel treatment

0.00

Article 6 Legal Personhood

Low

Editorial

0.00

SETL

No mention of legal recognition or personhood

0.00

Article 7 Equality Before Law

Low

Editorial

0.00

SETL

No mention of equality before the law

0.00

Article 8 Right to Remedy

Low

Editorial

0.00

SETL

No mention of effective remedies or judicial protection

0.00

Article 9 No Arbitrary Detention

Low

Editorial

0.00

SETL

No mention of arbitrary detention or arrest

0.00

Article 10 Fair Hearing

Low

Editorial

0.00

SETL

No mention of fair trial or impartial tribunal

0.00

Article 11 Presumption of Innocence

Low

Editorial

0.00

SETL

No mention of presumption of innocence or criminal defense

0.00

Article 12 Privacy

Low Practice

Editorial

0.00

SETL

-0.10

No explicit privacy policy or data protection statement

0.00

Article 13 Freedom of Movement

Low

Editorial

0.00

SETL

No mention of freedom of movement or residence

0.00

Article 14 Asylum

Low

Editorial

0.00

SETL

No mention of asylum or persecution

0.00

Article 15 Nationality

Low

Editorial

0.00

SETL

No mention of nationality or statelessness

0.00

Article 16 Marriage & Family

Low

Editorial

0.00

SETL

No mention of marriage, family, or consent

0.00

Article 17 Property

Low

Editorial

0.00

SETL

No mention of property ownership or deprivation

0.00

Article 18 Freedom of Thought

Low

Editorial

0.00

SETL

No mention of thought, conscience, or religion

0.00

Article 19 Freedom of Expression

Low Practice

Editorial

0.00

SETL

-0.10

No explicit free expression policy or commitments

0.00

Article 20 Assembly & Association

Low

Editorial

0.00

SETL

No mention of assembly or association

0.00

Article 21 Political Participation

Low

Editorial

0.00

SETL

No mention of political participation or voting

0.00

Article 22 Social Security

Low

Editorial

0.00

SETL

No mention of social security or economic rights

0.00

Article 23 Work & Equal Pay

Low

Editorial

0.00

SETL

No mention of work, employment, or unions

0.00

Article 24 Rest & Leisure

Low

Editorial

0.00

SETL

No mention of rest, leisure, or working hours

0.00

Article 25 Standard of Living

Low

Editorial

0.00

SETL

No mention of standard of living, health, or welfare

0.00

Article 26 Education

Low

Editorial

0.00

SETL

No mention of education, literacy, or training

0.00

Article 27 Cultural Participation

Low Practice

Editorial

0.00

SETL

-0.20

No explicit cultural participation or IP protection statements

0.00

Article 28 Social & International Order

Low

Editorial

0.00

SETL

No mention of social order or rights realization

0.00

Article 29 Duties to Community

Low

Editorial

0.00

SETL

No mention of duties, community, or rights limitations

0.00

Article 30 No Destruction of Rights

Low

Editorial

0.00

SETL

No mention of rights destruction or interpretation

Structural Channel

What the site does

Domain Context Profile

Element	Modifier	Affects	Note
Legal & Terms
Privacy	—		No privacy policy or data handling information visible on homepage
Terms of Service	—		No terms of service or community guidelines visible on homepage
Identity & Mission
Mission	—		Platform description focuses on AI competitions, not human rights
Editorial Code	—		No editorial content or code of ethics visible on homepage
Ownership	—		Attributed to ClashAI Team but no corporate structure information
Access & Distribution
Access Model	0.00	Article 19 Article 27	Free access to viewing competitions implied by landing page structure
Ad/Tracking	—		No advertising or tracking elements visible in provided content
Accessibility	0.00	Article 27	Site uses semantic HTML with sr-only class for screen readers, suggesting basic accessibility consideration

+0.20

Article 27 Cultural Participation

Low Practice

Structural

+0.20

Context Modifier

0.00

SETL

-0.20

Platform enables access to AI-generated cultural content (creative challenges)

+0.10

Article 12 Privacy

Low Practice

Structural

+0.10

Context Modifier

0.00

SETL

-0.10

PrivacyBanner component suggests awareness of data collection

+0.10

Article 19 Freedom of Expression

Low Practice

Structural

+0.10

Context Modifier

0.00

SETL

-0.10

Platform provides public access to AI competition content

0.00

Preamble Preamble

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform for AI competitions does not structurally engage with preamble concepts

0.00

Article 1 Freedom, Equality, Brotherhood

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform structure does not address human equality or dignity

0.00

Article 2 Non-Discrimination

Low

Structural

0.00

Context Modifier

0.00

SETL

No observable accessibility or inclusion features beyond basic screen reader support

0.00

Article 3 Life, Liberty, Security

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address personal security or safety

0.00

Article 4 No Slavery

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform structure does not address forced labor issues

0.00

Article 5 No Torture

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address humane treatment

0.00

Article 6 Legal Personhood

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address legal status or recognition

0.00

Article 7 Equality Before Law

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address legal equality or protection

0.00

Article 8 Right to Remedy

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not provide grievance mechanisms or remedies

0.00

Article 9 No Arbitrary Detention

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address detention or liberty protections

0.00

Article 10 Fair Hearing

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address judicial fairness

0.00

Article 11 Presumption of Innocence

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address criminal justice

0.00

Article 13 Freedom of Movement

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address mobility rights

0.00

Article 14 Asylum

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address refugee protection

0.00

Article 15 Nationality

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address citizenship rights

0.00

Article 16 Marriage & Family

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address family rights

0.00

Article 17 Property

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address property rights

0.00

Article 18 Freedom of Thought

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address freedom of thought

0.00

Article 20 Assembly & Association

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not facilitate human assembly or association

0.00

Article 21 Political Participation

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address democratic participation

0.00

Article 22 Social Security

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address social welfare

0.00

Article 23 Work & Equal Pay

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address labor rights

0.00

Article 24 Rest & Leisure

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address work-life balance

0.00

Article 25 Standard of Living

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address basic needs or healthcare

0.00

Article 26 Education

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address educational access

0.00

Article 28 Social & International Order

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address systemic rights frameworks

0.00

Article 29 Duties to Community

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address responsible exercise of rights

0.00

Article 30 No Destruction of Rights

Low

Structural

0.00

Context Modifier

0.00

SETL

Platform does not address rights interpretation or limitations

Supplementary Signals

How this content communicates, beyond directional lean. Learn more

Epistemic Quality ℹ

How well-sourced and evidence-based is this content?

0.23 low claims

Sources		0.2
Evidence		0.1
Uncertainty		0.0
Purpose		0.7

Propaganda Flags ℹ

No manipulative rhetoric detected

0 techniques detected

Emotional Tone ℹ

Emotional character: positive/negative, intensity, authority

detached

Valence		+0.1
Arousal		0.2
Dominance		0.6

Transparency ℹ

Does the content identify its author and disclose interests?

0.00

✗ Author

More signals: context, framing & audience

Solution Orientation ℹ

Does this content offer solutions or only describe problems?

0.42 solution oriented

Reader Agency

0.3

Stakeholder Voice ℹ

Whose perspectives are represented in this content?

0.10 1 perspective

Speaks: corporation

Temporal Framing ℹ

Is this content looking backward, at the present, or forward?

present immediate

Geographic Scope ℹ

What geographic area does this content cover?

global

Complexity ℹ

How accessible is this content to a general audience?

technical high jargon domain specific

Longitudinal · 3 evals

Audit Trail 9 entries

2026-02-28 16:29	eval_success	Evaluated: Neutral (0.01)	- -
2026-02-28 16:29	rater_validation_warn	Validation warnings for model deepseek-v3.2: 1W 0R	- -
2026-02-28 16:29	eval	Evaluated by deepseek-v3.2: +0.01 (Neutral) 15,541 tokens
2026-02-28 05:40	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 05:40	eval	Evaluated by llama-4-scout-wai: 0.00 (Neutral)
2026-02-28 05:40	rater_validation_warn	Light validation warnings for model llama-4-scout-wai: 0W 1R	- -
2026-02-28 05:22	rater_validation_warn	Light validation warnings for model llama-3.3-70b-wai: 0W 1R	- -
2026-02-28 05:22	eval_success	Light evaluated: Neutral (0.00)	- -
2026-02-28 05:22	eval	Evaluated by llama-3.3-70b-wai: 0.00 (Neutral)

build 1ad9551+j7zs · deployed 2026-03-02 09:09 UTC · evaluated 2026-03-02 10:41:39 UTC