-0.20 Mercury 2: Fast reasoning LLM powered by diffusion

Name: HRCB Evaluation: Mercury 2: Fast reasoning LLM powered by diffusion
Item: Mercury 2: Fast reasoning LLM powered by diffusion
Rating: -0.45
Author: HN HRCB

Model: @cf/meta/llama-3.3-70b-instruct-fp8-fast lite 0.00 @cf/meta/llama-4-scout-17b-16e-instruct lite 0.00 deepseek/deepseek-v3.2-20251201 +0.25 claude-haiku-4-5-20251001 -0.20 meta-llama/llama-3.3-70b-instruct:free ND nvidia/nemotron-3-nano-30b-a3b:free ND Compare

-0.20	Mercury 2: Fast reasoning LLM powered by diffusion (www.inceptionlabs.ai S:-0.40 )
	348 points by fittingopposite 5 days ago \| 128 comments on HN \| Moderate negative Contested Product · v3.7 · 2026-02-26 04:45:24 0

Summary Digital Privacy & Data Collection Neutral

This page is a truncated product announcement for Mercury 2 from Inception Labs. The provided content is primarily technical markup and styling code with minimal substantive editorial text regarding human rights. The only observable human rights signal is structural: Google Analytics and Framer analytics tracking are integrated without visible user consent mechanisms, suggesting data collection practices that may not fully respect privacy rights under Article 12.

Article Heatmap

Negative Neutral Positive No Data

Aggregates

Editorial Mean	-0.20	Structural Mean	-0.40
Weighted Mean	-0.45	Unweighted Mean	-0.45
Max	-0.45 Article 12	Min	-0.45 Article 12
Signal	1	No Data	30
Volatility	0.00 (Low)
Negative	1	Channels	E: 0.6 S: 0.4
SETL ℹ	+0.28	Editorial-dominant
FW Ratio ℹ	60%	3 facts · 2 inferences

Evidence 2% coverage ℹ

  1M   30 ND 

Theme Radar

HN Discussion 20 top-level · 26 replies

dvt 2026-02-25 00:09 UTC link

What excites me most about these new 4figure/second token models is that you can essentially do multi-shot prompting (+ nudging) and the user doesn't even feel it, potentially fixing some of the weird hallucinatory/non-deterministic behavior we sometimes end up with.

tl2do 2026-02-25 00:14 UTC link

Genuine question: what kinds of workloads benefit most from this speed? In my coding use, I still hit limitations even with stronger models, so I'm interested in where a much faster model changes the outcome rather than just reducing latency.

cjbarber 2026-02-25 00:28 UTC link

It could be interesting to do the metric of intelligence per second.

ie intelligence per token, and then tokens per second

My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.

But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.

ilaksh 2026-02-25 00:38 UTC link

It seems like the chat demo is really suffering from the effect of everything going into a queue. You can't actually tell that it is fast at all. The latency is not good.

Assuming that's what is causing this. They might show some kind of feedback when it actually makes it out of the queue.

nylonstrung 2026-02-25 01:26 UTC link

I'm not sold on diffusion models.

Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases

Here's more detail on how price/performance stacks up

https://artificialanalysis.ai/models/mercury-2

volodia 2026-02-25 01:57 UTC link

Co-founder / Chief Scientist at Inception here. If helpful, I’m happy to answer technical questions about Mercury 2 or diffusion LMs more broadly.

lprimeisafk 2026-02-25 02:13 UTC link

It fails the car wash test

nowittyusername 2026-02-25 02:32 UTC link

Nice, I'm excited to try this for my voice agent, at worst it could be used to power the human facing agent for latency reduction.

serjester 2026-02-25 02:53 UTC link

There's a potentially amazing use case here around parsing PDFs to markdown. It seems like a task with insane volume requirements, low budget, and the kind of thing that doesn't benefit much from autoregression. Would be very curious if your team has explored this.

rancar2 2026-02-25 04:36 UTC link

My attempt with trying one of their OOTB prompts in the demo https://chat.inceptionlabs.ai resulted in: "The server is currently overloaded. Please try again in a moment."

And a pop-up error of: "The string did not match the expected pattern."

That happened three times, then the interface stopped working.

I was hoping to see how this stacked up against Taalas demo, which worked well and was so fast every time I've hit it this past week.

vicchenai 2026-02-25 04:49 UTC link

The iteration speed advantage is real but context-specific. For agentic workloads where you're running loops over structured data -- say, validating outputs or exploring a dataset across many small calls -- the latency difference between a 50 tok/s model and a 1000+ tok/s one compounds fast. What would take 10 minutes wall-clock becomes under a minute, which changes how you prototype.

The open question for me is whether the quality ceiling is high enough for cases where the bottleneck is actually reasoning, not iteration speed. volodia's framing of it as a "fast agent" model (comparable tier to Haiku 4.5) is honest -- for the tasks that fit that tier, the 5x speed advantage is genuinely interesting.

smusamashah 2026-02-25 05:58 UTC link

Does it mean if it was embedded on a Talaas chip, it could generate ~50,000+ tokens per second?

Ross00781 2026-02-25 06:45 UTC link

Diffusion-based reasoning is fascinating - curious how it handles sequential dependencies vs traditional autoregressive. For complex planning tasks where step N heavily depends on steps 1-N, does the parallel generation sometimes struggle with consistency? Or does the model learn to encode those dependencies in a way that works well during parallel sampling?

swiftcoder 2026-02-25 08:29 UTC link

Are there any open-weights diffusion LLM models I can play with on my local hardware? Curious about the performance delta of this style of model in more resource constrained scenarios (i.e. consumer Nvidia GPU, not H100s in the datacenter)

vinhnx 2026-02-25 08:38 UTC link

This research paper "Mercury: Ultra-Fast Language Models Based on Diffusion" from last year (2025)

https://arxiv.org/pdf/2506.17298

DoctorOetker 2026-02-25 09:39 UTC link

> Mercury 2 doesn't decode sequentially. It generates responses through parallel refinement, producing multiple tokens simultaneously and converging over a small number of steps. Less typewriter, more editor revising a full draft at once.

There has been quite some progress unifying DDPM & SGM as SDE

> DDPM and Score-Based Models: The objective function of DDPMs (maximizing the ELBO) is equivalent to the score matching objectives used to train SGMs.

> SDE-based Formulation: Both DDPMs and SGMs can be unified under a single SDE framework, where the forward diffusion is an Ito SDE and the reverse process uses score functions to recover data.

> Flow Matching (Continuous-Time): Flow matching is equivalent to diffusion models when the source distribution corresponds to a Gaussian. Flow matching offers "straight" trajectories compared to the often curved paths of diffusion, but they share similar training objectives and weightings.

Is there a similar connection between modern transformers and diffusion?

Suppose we look at each layer or residual connection between layers, the context window of tokens (typically a power of 2), what is incrementally added to the embedding vectors is a function of the previous layer outputs, and if we have L layers, what is then the connection between those L "steps" of a transformer and similarly performing L denoising refinements of a diffusion model?

Does this allow fitting a diffusion model to a transformer and vice versa?

anshumankmr 2026-02-25 09:42 UTC link

It is capable of that seahorse faux pas. '''Is there a seahorse emoji?

Mercury 2

Thought for a minute

Well?

Mercury 2 Today at 3:06 PM

Yes – Unicode includes a seahorse emoji. It is U+1F9A0 and renders as:

(seahorse)

(If your device or browser doesn’t show the graphic, you may see a placeholder box.)

Ask

Explain

'''

sorenjan 2026-02-25 11:55 UTC link

Julia Turc recently did a video about diffusion LLMs as a paid collaboration with Inception: https://www.youtube.com/watch?v=-VGeHZqOk_s

Karuma 2026-02-25 17:02 UTC link

A simple test I just did:

Me: What are some of Maradona's most notable achievements in football?

Mercury 2 (first sentence only): Dieadona’s most notable football achievements include:

Notice the spelling of "Dieadona" instead of "Maradona". Even any local 3B model can answer this question perfectly fine and instantly... Mercury 2 was so incredibly slow and full of these kinds of unforgivable mistakes.

findjashua 2026-02-25 19:10 UTC link

failed the car wash test.

i think instead of postiioning as a general purpuse reasoning model, they'd have more success focusing on a specific use case (eg coding agent) and benchmark against the sota open models for the use case (eg qwen3-coder-next)

irthomasthomas 2026-02-25 00:19 UTC link

multi-model arbitration, synthesis, parallel reasoning etc. Judging large models with small models is quite effective.

layoric 2026-02-25 00:27 UTC link

I think it would assist in exploiting exploring multiple solution spaces in parallel, and can see with the right user in the loop + tools like compilers, static analysis, tests, etc wrapped harness, be able to iterate very quickly on multiple solutions. An example might be, "I need to optimize this SQL query" pointed to a locally running postgres. Multiple changes could be tested, combined, and explain plan to validate performance vs a test for correct results. Then only valid solutions could be presented to developer for review. I don't personally care about the models 'opinion' or recommendations, using them for architectural choices IMO is a flawed use as a coding tool.

It doesn't change the fact that the most important thing is verification/validation of their output either from tools, developer reviewing/making decisions. But even if don't want that approach, diffusion models are just a lot more efficient it seems. I'm interested to see if they are just a better match common developer tasks to assist with validation/verification systems, not just writing (likely wrong) code faster.

cjbarber 2026-02-25 00:38 UTC link

I've tried a few computer use and browser use tools and they feel relatively tok/s bottlenecked.

And in some sense, all of my claude code usage feels tok/s bottlenecked. There's never really a time where I'm glad to wait for the tokens, I'd always prefer faster.

josephg 2026-02-25 02:01 UTC link

Yeah I agree with this. We might be able to benchmark it soon (if we can’t already) but asking different agentic code models to produce some relatively simple pieces of software. Fast models can iterate faster. Big models will write better code on the first attempt, and need less loop debugging. Who will win?

At the moment I’m loving opus 4.6 but I have no idea if its extra intelligence makes it worth using over sonnet. Some data would be great!

volodia 2026-02-25 02:03 UTC link

I’d push back a bit on the Pareto point.

On speed/quality, diffusion has actually moved the frontier. At comparable quality levels, Mercury is >5× faster than similar AR models (including the ones referenced on the AA page). So for a fixed quality target, you can get meaningfully higher throughput.

That said, I agree diffusion models today don’t yet match the very largest AR systems (Opus, Gemini Pro, etc.) on absolute intelligence. That’s not surprising: we’re starting from smaller models and gradually scaling up. The roadmap is to scale intelligence while preserving the large inference-time advantage.

volodia 2026-02-25 02:09 UTC link

Thank you for your patience. We are working to handle the surge in demand.

bigbuppo 2026-02-25 02:13 UTC link

Maybe make that intelligence per token per relative unit of hardware per watt. If you're burning 30 tons of coal to be 0.0000000001% better than the 5 tons of coal option because you're throwing more hardware at it, well, it's not much of a real improvement.

volodia 2026-02-25 02:17 UTC link

That is also our view! We see Mercury 2 as enabling very fast iteration for agentic tasks. A single shot at a problem might be less accurate, but because the model has a shorter execution time, it enables users to iterate much more quickly.

emi2k01 2026-02-25 02:18 UTC link

The test:

> The car wash is 100m away from my house. Should I walk or drive?

I didn't find a way to share the conversation, so here it is the full response in `medium` settings:

---

Walking 100 m is a very short distance—roughly a minute or two at a normal pace. In most situations, walking will be quicker overall because you avoid the time needed to get into the car, start it, and find parking. It also saves fuel and is better for the environment.

*Consider these factors before deciding:*

| Factor | Walking | Driving | |--------|---------|---------| | *Time* | ~1–2 min (plus a few seconds to step out) | ~1 min driving, plus ~1–2 min to get in/out and park | | *Energy/Fuel* | None | Uses a small amount of fuel | | *Weather* | Comfortable if dry & mild | Better if it’s raining, very hot, or icy | | *Physical ability* | Easy for most people | Not needed for such a short trip | | *Car availability* | Not required | Must have a car nearby and be willing to use it | | *Environmental impact* | Zero emissions | Small emissions |

*Recommendation:* If the weather is pleasant and you don’t need the car for anything else right away, walking is the simplest, fastest, and most eco‑friendly choice. Drive only if you’re dealing with inclement weather, have heavy items to carry, or need the car immediately after the wash.

Do you have any specific constraints (e.g., rain, heavy bags, time pressure) that might affect the decision?

volodia 2026-02-25 02:21 UTC link

We agree! In fact, there is an emerging class of models aimed at fast agentic iteration (think of Composer, the Flash versions of proprietary and open models). We position Mercury 2 as a strong model in this category.

volodia 2026-02-25 02:26 UTC link

There are few: fast agents, deep research, real-time voice, coding. The other thing is that when you have a fast reasoning model, you spend more effort on thinking in the same latency budget, which pushed up quality.

techbro92 2026-02-25 02:34 UTC link

Do you think you will be moving towards drifting models in the future for even more speed?

estsauver 2026-02-25 02:36 UTC link

I think there's clearly a "Speed is a quality of it's own" axis. When you use Cereberas (or Groq) to develop an API, the turn around speed of iterating on jobs is so much faster (and cheaper!) then using frontier high intelligence labs, it's almost a different product.

Also, I put together a little research paper recently--I think there's probably an underexplored option of "Use frontier AR model for a little bit of planning then switch to diffusion for generating the rest." You can get really good improvements with diffusion models! https://estsauver.com/think-first-diffuse-fast.pdf

volodia 2026-02-25 02:37 UTC link

Would love to hear about your experience. Send us an email.

nowittyusername 2026-02-25 02:39 UTC link

How does the whole kv cache situation work for diffusion models? Like are there latency and computation/monetary savings for caching? is the curve similar to auto regressive caching options? or maybe such things dont apply at all and you can just mess with system prompt and dynamically change it every turn because there's no savings to be had? or maybe you can make dynamic changes to the head but also get cache savings because of diffusion based architecture?... so many ideas...

nl 2026-02-25 02:56 UTC link

I had a very odd interaction somewhat similar to how weak transformer models get into a loop:

https://gist.github.com/nlothian/cf9725e6ebc99219f480e0b72b3...

What causes this?

ainch 2026-02-25 03:13 UTC link

This understates the possible headroom as technical challenges are addressed - text diffusion is significantly less developed than autoregression with transformers, and Inception are breaking new ground.

lostmsu 2026-02-25 03:41 UTC link

Regular models are very fast if you do batch inference. GPT-OSS 20B gets close to 2k tok/s on a single 3090 at bs=64 (might be misremembering details here).

nylonstrung 2026-02-25 03:59 UTC link

I changed my mind: this would be perfect for a fast edit model ala Morph Fast Apply https://www.morphllm.com/products/fastapply

It looks like they are offering this in the form of "Mercury Edit"and I'm keen to try it

bcherry 2026-02-25 06:57 UTC link

you mention voice ai in the announcement but I wonder how this works in practice. most voice AI systems are bound not by full response latency but just by time-to-first-non-reasoning-token (because once it heads to TTS, the output speed is capped at the speed of speech and even the slowest models are generating tokens faster than that once they start going).

what do ttft numbers look like for mercury 2? I can see how at least compared to other reasoning models it could improve things quite a bit but i'm wondering if it really makes reasoning viable in voice given it seems total latency is still in single digit seconds, not hundreds of milliseconds

mynti 2026-02-25 09:18 UTC link

I always wondered how these models would reason correctly. I suppose they are diffusing fixed blocks of text for every step and after the first block comes the next and so on (that is how it looks in the chat interface anyways). But what happens if at the end of the first block it would need information about reasoning at the beginning of the first block? Autoregressive Models can use these tokens to refine the reasoning but I guess that Diffusion Models can only adjust their path after every block? Is there a way maybe to have dynamic block length?

jakubtomanik 2026-02-25 10:18 UTC link

Intelligence per second is a great metric. I never could fully articulate why I like Gemini 3 Flash but this is exactly why. It’s smart enough and unbelievably fast. Thanks for sharing this

storus 2026-02-25 10:52 UTC link

I'd say using them as draft models for some strong AR model, speeding it up 3x. Diffusion generates a bunch of tokens extremely fast, those can be then passed over to an AR model to accept/reject instead of generating them.

espadrine 2026-02-25 11:22 UTC link

Interestingly, while it uses diffusion, it generates incorrect information, and it doesn't fix it when later in the text it realizes that it is incorrect:

  > The snail you’re likely thinking of has a different code point:
  > The correct code point for the snail emoji is U+1F40C? No—U+1F40C is SEAHORSE. The snail’s true code point is U+1F40C? Wait, this is still confusing.
  >
  > Let’s set the record straight:
  > 
  >    Snail → U+1F40C? No, that’s wrong.  
  >    Snail → U+1F40C? Still wrong.
  >
  > The snail emoji is actually U+1F40C? I’m mixing things up.
  >
  > Correct snail code point
  >
  > The snail emoji is U+1F40C? No…  
  >
  > After double‑checking the official Unicode tables (Unicode 13.0 and later), the snail is U+1F40C? The…  
  > Given the confusion, the safest answer is: the Unicode Standard defines U+1F40C as “SEAHORSE”. If your device shows a snail, it’s a rendering quirk, not a change in the underlying code point.

Havoc 2026-02-25 15:45 UTC link

Think pretty much anything is going to get a enormous speed boost if the model isn’t undergoing mem latency but is just inherently baked into the circuits asic style

Jianghong94 2026-02-25 19:15 UTC link

Honestly I don't understand why they/any fast-and-error-prone model position themselves as coding agents; my experience tells me that I'd much rather working with a slow-but-correct model and let it run longer session than handholding a fast-but-wrong model.

Editorial Channel

What the content says

-0.20

Article 12 Privacy

Medium Practice

Editorial

-0.20

SETL

+0.28

Content does not explicitly address privacy but framing does not indicate privacy protections or user consent mechanisms.

Preamble Preamble

Page content is truncated and does not include substantive narrative text addressing human dignity, inherent rights, or principles of justice and freedom.

Article 1 Freedom, Equality, Brotherhood

No observable content addressing equal rights and dignity of all humans.

Article 2 Non-Discrimination

No observable content addressing non-discrimination or freedom from distinction.

Article 3 Life, Liberty, Security

No observable content addressing life, liberty, or personal security.

Article 4 No Slavery

No observable content addressing slavery or servitude.

Article 5 No Torture

No observable content addressing torture or cruel treatment.

Article 6 Legal Personhood

No observable content addressing recognition as person before law.

Article 7 Equality Before Law

No observable content addressing equal protection under law.

Article 8 Right to Remedy

No observable content addressing remedy by competent tribunals.

Article 9 No Arbitrary Detention

No observable content addressing arbitrary arrest or detention.

Article 10 Fair Hearing

No observable content addressing fair hearing and due process.

Article 11 Presumption of Innocence

No observable content addressing presumption of innocence.

Article 13 Freedom of Movement

No observable content addressing freedom of movement.

Article 14 Asylum

No observable content addressing asylum or persecution.

Article 15 Nationality

No observable content addressing nationality or statelessness.

Article 16 Marriage & Family

No observable content addressing marriage or family rights.

Article 17 Property

No observable content addressing property rights.

Article 18 Freedom of Thought

No observable content addressing freedom of thought, conscience, or religion.

Article 19 Freedom of Expression

No observable content addressing freedom of opinion and expression.

Article 20 Assembly & Association

No observable content addressing freedom of assembly or association.

Article 21 Political Participation

No observable content addressing participation in government.

Article 22 Social Security

No observable content addressing social security or welfare.

Article 23 Work & Equal Pay

No observable content addressing labor rights or work.

Article 24 Rest & Leisure

No observable content addressing rest and leisure.

Article 25 Standard of Living

No observable content addressing health, food, or adequate standard of living.

Article 26 Education

No observable content addressing education rights.

Article 27 Cultural Participation

No observable content addressing participation in cultural life.

Article 28 Social & International Order

No observable content addressing social and international order.

Article 29 Duties to Community

No observable content addressing community duties.

Article 30 No Destruction of Rights

No observable content addressing prohibition of destruction of rights.

Structural Channel

What the site does

-0.40

Article 12 Privacy

Medium Practice

Structural

-0.40

Context Modifier

-0.15

SETL

+0.28

Google Analytics tracking (gtag config ID G-HX11RSR827) embedded without explicit cookie consent banner or privacy disclosure visible in provided content. Framer analytics infrastructure integrated. Analytics tracking occurs by default without apparent user opt-in.

Preamble Preamble

No observable structural signals related to preamble values.

Article 1 Freedom, Equality, Brotherhood