1735 points by preek 104 days ago | 1056 comments on HN
| Neutral Editorial · v3.7· 2026-02-28 10:19:53 0
Summary Technology/AI Advancement Neutral
This corporate product announcement introduces Google's Gemini 3 AI model with focus on capabilities and technological advancement. The content shows no substantive engagement with human rights safeguards, responsible development, equity, labor impact, privacy, or accountability mechanisms. While article represents scientific progress contribution (mild positive on Article 27), pervasive absence of rights-protective framing across labor, fairness, privacy, expression, responsibility, and harm-prevention articles generates net mild-negative HRCB signal.
I'm sure this is a very impressive model, but gemini-3-pro-preview is failing spectacularly at my fairly basic python benchmark. In fact, gemini-2.5-pro gets a lot closer (but is still wrong).
For reference: gpt-5.1-thinking passes, gpt-5.1-instant fails, gpt-5-thinking fails, gpt-5-instant fails, sonnet-4.5 passes, opus-4.1 passes (lesser claude models fail).
This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks. A lot of people are going to say "wow, look how much they jumped in x, y, and z benchmark" and start to make some extrapolation about society, and what this means for others. Meanwhile.. I'm still wondering how they're still getting this problem wrong.
edit: I've a lot of good feedback here. I think there are ways I can improve my benchmark.
My favorite benchmark is to analyze a very long audio file recording of a management meeting and produce very good notes along with a transcript labeling all the speakers. 2.5 was decently good at generating the summary, but it was terrible at labeling speakers. 3.0 has so far absolutely nailed speaker labeling.
Many can point to a long history of killed products and soured opinions but you can't deny theyve been the great balancing force (often for good) in the industry.
- Gmail vs Outlook
- Drive vs Word
- Android vs iOS
- Worklife balance and high pay vs the low salary grind of before.
Theyve done heaps for the industry. Im glad to see signs of life. Particularly in their P/E which was unjustly low for awhile.
It still failed my image identification test ([a photoshopped picture of a dog with 5 legs]...please count the legs) that so far every other model has failed agonizingly, even failing when I tell them they are failing, and they tend to fight back at me.
Gemini 3 however, while still failing, at least recognized the 5th leg, but thought the dog was...well endowed. The 5th leg however is clearly a leg, despite being where you would expect the dogs member to be. I'll give it half credit for at least recognizing that there was something there.
Still though, there is a lot of work that needs to be done on getting these models to properly "see" images.
Just generated a bunch of 3D CAD models using Gemini 3.0 to see how it compares in spatial understanding and it's heaps better than anything currently out there - not only intelligence but also speed.
Will run extended benchmarks later, let me know if you want to see actual data.
Feels like the same consolidation cycle we saw with mobile apps and browsers are playing out here. The winners aren’t necessarily those with the best models, but those who already control the surface where people live their digital lives.
Google injects AI Overviews directly into search, X pushes Grok into the feed, Apple wraps "intelligence" into Maps and on-device workflows, and Microsoft is quietly doing the same with Copilot across Windows and Office.
Open models and startups can innovate, but the platforms can immediately put their AI in front of billions of users without asking anyone to change behavior (not even typing a new URL).
If you are transferring a conversation trace from another model, ... to bypass strict validation in these specific scenarios, populate the field with this specific dummy string:
"thoughtSignature": "context_engineering_is_the_way_to_go"
I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
Well, I tried a variation of a prompt I was messing with in Flash 2.5 the other day in a thread about AI-coded analog clock faces. Gemini Pro 3 Preview gave me a result far beyond what I saw with Flash 2.5, and got it right in a single shot.[0] I can't say I'm not impressed, even though it's a pretty constrained example.
> Please generate an analog clock widget, synchronized to actual system time, with hands that update in real time and a second hand that ticks at least once per second. Make sure all the hour markings are visible and put some effort into making a modern, stylish clock face. Please pay attention to the correct alignment of the numbers, hour markings, and hands on the face.
I have my own private benchmarks for reasoning capabilities on complex problems and i test them against SOTA models regularly (professional cases from law and medicine).
Anthropic (Sonnet 4.5 Extended Thinking) and OpenAI (Pro Models) get halfway decent results on many cases while Gemini Pro 2.5 struggled (it was overconfident in its initial assumptions).
So i ran these benchmarks against Gemini 3 Pro and i'm not impressed. The reasoning is way more nuanced than their older model but it still makes mistakes which the other two SOTA competitor models don't make. Like it forgets in a law benchmark that those principles don't apply in the country from the provided case. It seems very US centric in its thinking whereas Anthropic and OpenAI pro models seem to be more aware around the context of assumed culture from the case. All in - i don't think this new model is ahead of the other two main competitors - but it has a new nuanced touch and is certainly way better than Gemini 2.5 pro (which is more telling how bad actually that one was for complex problems).
Out of curiosity, I gave it the latest project euler problem published on 11/16/2025, very likely out of the training data
Gemini thought for 5m10s before giving me a python snippet that produced the correct answer. The leaderboard says that the 3 fastest human to solve this problem took 14min, 20min and 1h14min respectively
Even thought I expect this sort of problem to very much be in the distribution of what the model has been RL-tuned to do, it's wild that frontier model can now solve in minutes what would take me days
Has anyone who is a regular Opus / GPT5-Codex-High / GPT5 Pro user given this model a workout? Each Google release is accompanied by a lot of devrel marketing that sounds impressive but whenever I put the hours into eval myself it comes up lacking. Would love to hear that it replaces another frontier model for someone who is not already bought into the Gemini ecosystem.
I have "unlimited" access to both Gemini 2.5 Pro and Claude 4.5 Sonnet through work.
From my experience, both are capable and can solve nearly all the same complex programming requests, but time and time again Gemini spits out reams and reams of code so over engineered, that totally works, but I would never want to have to interact with.
When looking at the code, you can't tell why it looks "gross", but then you ask Claude to do the same task in the same repo (I use Cline, it's just a dropdown change) and the code also works, but there's a lot less of it and it has a more "elegant" feeling to it.
I know that isn't easy to capture in benchmarks, but I hope Gemini 3.0 has improved in this regard
I love it that there's a "Read AI-generated summary" button on their post about their new AI.
I can only expect that the next step is something like "Have your AI read our AI's auto-generated summary", and so forth until we are all the way at Douglas Adams's Electric Monk:
> The Electric Monk was a labour-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself; video recorders watched tedious television for you, thus saving you the bother of looking at it yourself. Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.
I was sorting out the right way to handle a medical thing and Gemini 2.5 Pro was part of the way there, but it lacked some necessary information. Got the Gemini 3.0 release notification a few hours after I was looking into that, so I tried the same exact prompt and it nailed it. Great, useful, actionable information that surfaced actual issues to look out for and resolved some confusion. Helped work through the logic, norms, studies, standards, federal approvals and practices.
Very good. Nice work! These things will definitely change lives.
I spent years building a compiler that takes our custom XML format and generates an app for Android or Java Swing. Gemini pulled off the same feat in under a minute, with no explanation of the format. The XML is fairly self-explanatory, but still.
I tried doing the same with Lovable, but the resulting app wouldn't work properly, and I burned through my credits fast while trying to nudge it into a usable state. This was on another level.
> This is a reminder that benchmarks are meaningless – you should always curate your own out-of-sample benchmarks.
Yeah I have my own set of tests and the results are a bit unsettling in the sense that sometimes older models outperform newer ones. Moreover, they change even if officially the model doesn't change. This is especially true of Gemini 2.5 pro that was performing much better on the same tests several months ago vs. now.
No they’re not. Maybe you mean to say they don’t tell the whole story or have their limitations, which has always been the case.
>>my fairly basic python benchmark
I suspect your definition of “basic” may not be consensus. Gpt-5 thinking is a strong model for basic coding and it’d be interesting to see a simple python task it reliably fails at.
I like to ask "Make a pacman game in a single html page". No model has ever gotten a decent game in one shot. My attempt with Gemini3 was no better than 2.5.
Google always has been there, its just that many didn't realize that DeepMind even existed and I said that they needed to be put to commercial use years ago. [0] and Google AI != DeepMind.
You are now seeing their valuation finally adjusting to that fact all thanks to DeepMind finally being put to use.
Using a single custom benchmark as a metric seems pretty unreliable to me.
Even at the risk of teaching future AI the answer to your benchmark, I think you should share it here so we can evaluate it. It's entirely possible you are coming to a wrong conclusion.
They've poisoned the internet with their monopoly on advertising, the air pollution of the online world, which is an transgression that far outweighs any good they might have done. Much of the negative social effects of being online come from the need to drive more screen time, more engagement, more clicks, and more ad impressions firehosed into the faces of users for sweet, sweet, advertiser money. When Google finally defeats ad-blocking, yt-dlp, etc., remember this.
AI overviews has arguable done more harm than good for them, because people assume it's Gemini, but really it's some ultra light weight model made for handling millions of queries a minute, and has no shortage of stupid mistakes/hallucinations.
Perception seems to be one of the main constraints on LLMs that not much progress has been made on. Perhaps not surprising, given perception is something evolution has worked on since the inception of life itself. Likely much, much more expensive computationally than it receives credit for.
What I would do if I was in the position of a large company in this space is to arrange an internal team to create an ARC replica, covering very similar puzzles and use that as part of the training.
Ultimately, most benchmarks can be gamed and their real utility is thus short-lived.
But I think this is also fair to use any means to beat it.
I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.
It's an artifact of the problem that they don't show you the reasoning output but need it for further messages so they save each api conversation on their side and give you a reference number. It sucks from a GDPR compliance perspective as well as in terms of transparent pricing as you have no way to control reasoning trace length (which is billed at the much higher output rate) other than switching between low/high but if the model decides to think longer "low" could result in more tokens used than "high" for a prompt where the model decides not to think that much. "thinking budgets" are now "legacy" and thus while you can constrain output length you cannot constrain cost. Obviously you also cannot optimize your prompts if some red herring makes the LLM get hung up on something irrelevant only to realize this in later thinking steps. This will happen with EVERY SINGLE prompt if it's caused by something in your system prompt. Finding what makes the model go astray can be rather difficult with 15k token system prompts or a multitude of MCP tools, you're basically blinded while trying to optimize a black box. Obviously you can try different variations of different parts of your system prompt or tool descriptions but just because they result in less thinking tokens does not mean they are better if those reasoning steps where actually beneficial (if only in edge cases) this would be immediately apparent upon inspection but hard/impossible to find out without access to the full Chain of Thought. For the uninitiated, the reasons OpenAI started replacing the CoT with summaries, were A. to prevent rapid distillation as they suspected deepSeek to have used for R1 and B. to prevent embarrassment if App users see the CoT and find parts of it objectionable/irrelevant/absurd (reasoning steps that make sense for an LLM do not necessarily look like human reasoning). That's a tradeoff that is great with end-users but terrible for developers. As Open Weights LLMs necessarily output their full reasoning traces the potential to optimize prompts for specific tasks is much greater and will for certain applications certainly outweigh the performance delta to Google/OpenAI.
At this point I'm only using google models via Vertex AI for my apps. They have a weird QoS rate limit but in general Gemini has been consistently top tier for everything I've thrown at it.
Anecdotal, but I've also not experienced any regression in Gemini quality where Claude/OpenAI might push iterative updates (or quantized variants for performance) that cause my test bench to fail more often.
To be fair a lot of the impressive Elo scores models get are simply due to the fact that they're faster: many serious competitive coders could get the same or better results given enough time.
But seeing these results I'd be surprised if by the end of the decade we don't have something that is to these puzzles what Stockfish is to chess. Effectively ground truth and often coming up with solutions that would be absolutely ridiculous for a human to find within a reasonable time limit.
AI with educational applications implied but no discussion of equitable education access, learning rights, or benefit distribution to marginalized communities.
FW Ratio: 50%
Observable Facts
Gemini 3 capabilities span assistance/analysis suitable for educational contexts.
Inferences
Educational potential mentioned implicitly through capability descriptions, but access equity/learning rights absent from framing.
AI model trained on data; announcement absent discussion of creator rights, fair compensation for training data sources, or intellectual property considerations.
FW Ratio: 50%
Observable Facts
Article announces AI model release with no discussion of training data sourcing ethics.
Inferences
Omission of property rights discussion implies extractive framing — benefit/model attributed to Google only, not data contributors.
AI tool for content generation presented without discussion of freedom of thought, conscience, or responsibility to prevent manipulation of belief formation.
FW Ratio: 67%
Observable Facts
Gemini 3 described as model for content generation and assistance.
No discussion of safeguards against manipulation of thought/belief or consent in persuasion.
Inferences
Capability framing absent ethical guardrails against AI-driven opinion/belief manipulation.
Global AI model affecting information access/political discourse announced without discussion of equitable participation, access barriers, or democratic safeguards.
FW Ratio: 50%
Observable Facts
Gemini 3 described as globally released AI model with decision-making/analysis capabilities.
Inferences
Political participation implications of AI not addressed; framing treats technology as apolitical.
AI for information/content work announced without discussion of misinformation safeguards, fact-checking accountability, or free expression protection mechanisms.
FW Ratio: 60%
Observable Facts
Article subtitle: 'A new era of intelligence with Gemini 3' implies broad informational capability.
Page includes active social sharing links (X/Twitter, Facebook, LinkedIn, Email) for dissemination.
No discussion of fact-checking, misinformation detection, or editorial accountability in AI outputs.
Inferences
Capability framing omits responsibility for truth/accuracy in AI-generated content, privileging distribution over verification.
Social amplification infrastructure present but accountability mechanisms absent, suggesting potential for rapid misinformation spread.
Product announcement absent responsibility framing; no discussion of developer duties, ethical guardrails, prevention of misuse, or accountability mechanisms.
FW Ratio: 60%
Observable Facts
Article announces capabilities without discussing responsible development, testing, or safeguards.
No ethics policy, harm reporting, or user accountability mechanism visible on page.
Authors (executives) present as product promoters, not responsibility stewards.
Inferences
Framing positions Google as capability provider, not duty-bearer for harms; responsibility elided from narrative.
Absence of accountability structures suggests product-first rather than community-responsible positioning.