This Guardian article reports on a peer-reviewed Nature Medicine study documenting that ChatGPT Health fails to recognize medical emergencies in over 51% of test cases, with medical experts warning of preventable harm and death. The coverage strongly aligns with UDHR rights to life, health, and social accountability (Articles 3, 25, 28, 29), advocating urgently for independent safety auditing, clear standards, and corporate transparency. However, the problem-focused framing limits constructive solutions, and readers receive minimal actionable agency beyond awareness.
I really only use ChatGPT as a better search engine. But it's often wrong, which has actually ended up costing me money. I don't put a lot of trust in it. Certainly would not try to use it as a doctor.
I'd greatly prefer a blind study comparing doctors to AI, rather than a study of doctors feeding AI scenarios and seeing if it matches their predetermined outcome.
Edit: People seem confused here. The study was feeding the AI structured clinical scenarios and seeing it's results. The study was not a live analyses of AI being used in the field to treat patients.
I have had some incredible medical advice from ChatGPT. It has saved me from small mystery issues, like a rash on my face. Small enough issues that I probably wouldn't have bothered to go into a doctor. BUT it also failed to diagnose me with a medical issue that ended up with a trip to the ER and emergency surgery.
A few weeks before the ER, I was having stomach pain. I went to the doctor with theories from ChatGPT in hand, they checked me for those things and then didn't check me for what ended up being a pretty obvious issue. What's interesting is that I mentioned to the doctor that I used ChatGPT and that the doctor even seemed to value that opinion and did not consider other options (and what it ultimately ended up being was rare but really obvious in retrospect, I think most doctors would have checked for it). I do feel I actually biased the first doctors opinion with my "research."
Even though these tools are showing time and time again that they have serious reliability issues, somehow people still think it is a good idea to use them for critical decisions.
Still regularly get wrong information from google’s search AI.
Really starting to wonder if common sense is ever going to come back with new tech, but I fear it is going to require something truly catastrophic to happen.
Has anyone tried to suggest sudoku puzzles? In the middle of a hard game I will submit the screenshot to copilot or Gemini and it hallucinates suggestions on next move.
I think there is so much potential for AI in healthcare, but we absolutely HAVE to go through the existing ruleset of conducting years of research and trials and approvals before pushing anything out to patients. Move fast and break things is simply not an option in healthcare.
A friend of mine had an accident. He was taken to the emergency room, but the doctors there thought his injuries were minor. My friend insisted that he was bleeding out internally. They finally checked for that, and it turns out he was minutes from dying.
AI wasn't involved in this case, but it's good to have both AI and a trained doctor in the decision loop.
Adding normal lab results made the suicide crisis banner disappear? That's a weird failure mode. You'd expect unrelated context to be ignored, not to override the risk signal.
The reality is entering the healthcare system can result in thousands of dollars in bills. People make risk/cost judgement on going to the hospital or not.
Is this unsurprising? It's a fancy markov chain. It's like using a slot machine to diagnose medical conditions. I guess it's a slot machine with really good marketing.
I think the worse situation is the bad AI summaries from search on health issues.
We had a potential pet poisoning, so was naturally searching for resources. Google had a summary with a "dose of concern" that was an order of magnitude off. Someone could have read that and thought all was fine and had a dead cat.
(BTW cat is fine, turned out to be a false alarm, but public service announcement: cats are alergic to aspirin and peptobismal has aspirin. don't leave demented plastic chewing cats around those bottles, in case you too have a lovely but demented cat)
There is a concept of “the burden or knowledge”, in that doctors know the worst thing that could happen, so they recommend the most cautious approach. My son had stomach pain one time when he was young. We took him to urgent care because it was a stomach ache. The doctor there said we needed to go to the ER because it could be an appendicitis. So we trucked to the ER. Close to $2000 later he was diagnosed with idiopathic stomach pain and told to wait it out at home.
So when I read “they then compared the platform’s recommendations with the doctors’ assessments” and see a mismatch, I wonder if it’s because human doctors are overly cautious or that the AI was wrong.
But that all pales in what could be the actual issue. I can’t read the original study, but if it use the USA, it’s understandable why people are turning to AI for Health advice. Healthcare is painfully expensive here. Even a simple trip to the ER (e.g. a $2000 stomach ache) is beyond a lot of people’s ability to spend. That’s just a reality.
With that in mind, the real questions “should I do nothing about my symptoms because I can’t afford healthcare or should I at least ask AI knowing it could be wrong”.
Maybe because human interaction, part of a doctor's training, is not documented as internet blog posts, so ChatGPT didn't learn and failed because of it? LLM is just learning from what's written.
If the AI gets attached to a health insurer (not the case here as far as I know), I would expect it to make decisions that are aligned with the company’s incentive to weed out unprofitable patients. AI is not a human who takes a Hippocratic oath; it can be more easily manipulated to perform unethical acts.
I don't understand this reasoning. Randomizing people to AI vs standard of care is expensive and risky. Checking whether the AI can pass hypothetical scenarios seems like a perfectly reasonable approach to researching the safety of these models before running a clinical trial.
A friend of mine had such a bad experience with _multiple_ American doctors missing a major issue that nearly ended up killing her that she decided that, were she to have kids, she would go back to Russia rather than be pregnant in the American medical system.
Now, I don't agree that this is a good decision, but the point is, human doctors also often miss major problems.
Amazing how you can just deflect any criticism of LLMs here by going “but humans suck too!” And the misanthropic HN userbase eats it up every time.
We live during the healthiest period in human history due to the fact that doctors are highly reliable and well-trained. You simply would not be able to replace a real doctor with an LLM and get desirable results.
It's really the "common sense" i.e. believing things without thinking because they "sound right" or because it's what your parents told you a lot growing up or because you watched an ad saying it a hundred times that's the issue. People don't want "the truth" or uncomfortable realities; they want comfortable, easily digestible bullshit. Smooth talkers filled the role before and LLMs are filling that role now.
> what it ultimately ended up being was rare but really obvious in retrospect, I think most doctors would have checked for it
I'm not so sure. Doctors are trained to check for the most common things that explain the symptoms. "When you hear hoofbeats, think horses not zebras" is a saying that is often heard in medicine.
ChatGPT was trained on the same medical textbooks and research papers that doctors are.
> I do feel I actually biased the first doctors opinion with my "research."
It may feel easy to say doctors should just consider all the options. But telling them an option is worse than just biasing their thinking; they are going to interpret that as information about your symptoms.
If you feel pain in your abdomen but are only talking about your appendix, they are rightfully going to think the pain is in the region of your appendix. They are not going to treat you like you have kidney pain. How could they? If they have to treat all of your descriptions as all the things that you could be relating them to, then that information is practically useless.
Medical errors are one of the leading causes of death. It's a real catch-22. If you're under medical care for something serious, there's a real chance that someone will make a mistake that kills you.
In the general case it's usually not possible to accurately review an individual physician's performance. The software developers here on HN like to think in simplistic binary terms but in the real world of clinical care there is usually no reliable source of truth to evaluate against. Occasionally we see egregious cases of malpractice or failure to follow established clinical practice guidelines but below that there's a huge gray area.
If you look at online reviews, doctors are mostly rated based on being "nice" but that has little bearing on patient outcomes.
We have standards of care for a reason. They are the most basic requirements of testing. Ignoring them is not just being a bad doctor, its unethical treatment. Its the absolute bare minimum of a medical system.
This is ultimately the same difference between a search engine and a professional. 10 years before this, Googling the symptoms was a thing.
I have a family member who had a "rare but obvious" one but it took 5 doctors to get to the diagnosis. What we really need to see are attempts to blind studies and real statistical rigor. It's funny to paint a tunnel on a canvas and get a Tesla to drive into it, but there's a reason studies (and the more blind the better) are the standard.
> I do feel I actually biased the first doctors opinion with my "research."
This has been a big problem in medicine since the early days of WebMD: Each appointment has a limited time due to the limited supply of doctors and high demand for appointments.
When someone arrives with their own research, the doctor has to make a choice: Do they work with what the patient brought and try to confirm or rule it out, or do they try to walk back their research and start from the beginning?
When doctors appear to disregard the research patients arrive with many patients get very angry. It leads to negative reviews or even formal complaints being filed (usually from encouragement from some Facebook group or TikTok community they were in). There might even be bigger problems if the patient turns out to be correct and the doctor did not embrace the research, which can prompt lawsuits.
So many doctors will err on the side of focusing on patient-provided theories first. Given the finite time available to see each patient (with waiting lists already extending months out in some places) this can crowd out time for getting a big picture discussion through the doctor's own diagnostic process.
When I visit a doctor I try to ground myself to starting with symptoms first and try to avoid biasing toward my thoughts about what it might be. Only if the conversation is going nowhere do I bring out my research, and then only as questions rather than suggestions. This seems to be more helpful than what I did when I was younger, which is research everything for hours and then show up with an idea that I wanted them to confirm or disprove.
It's a strange paradigm shift, where the tool is right and useful most of than not, but also make expensive mistakes that would have been spotted easily by an expert.
The real story hear your doctor actually listened to you. I appreciate what a lot doctors do, but majority of them fucking irritating and don’t even listen your issues, I’m glad we have AI and less reliant on them.
It depends; people actually get sicker and even die due to endless backlog and lack of doctors (in most developed countries). It's not as if everyone gets optimal care now. A.I can at least expedite things hopefully.
>AI wasn't involved in this case, but it's good to have both AI and a trained doctor in the decision loop.
That doesn't necessarily follow from your story. The AI's specificity and sensitivity are important, which is why we need to study this stuff. An AI that produces too many false positives will send doctors off chasing zebras and they'll waste time, which will result in more deaths.
An AI that produces too many false negatives will make doctors more likely to miss things they otherwise would have checked, which will result in more deaths.
The other real problem with using AI in a medical setting is that AI is very very good at producing plausible sounding wrong information. Even an expert isn't immune to this. So it's even more important that we study how likely they are to be wrong.
I have found the LLMs to be wrong in random insidious ways, so trusting them with anything critical is terrifying.
Recent (as in last few days/weeks) incidents using different models/tools:
* Google AI search summary compare product A & B, call out a bunch of differences that are correct.. and then threw in features that didn't exist
* Work (midsize company with big AI team / homebuilt GPT wrappers) PDF parsing for company headquarters address, it hallucinated an address that didn't exist in the document
* Work, a team using frontier model from top 2 AI lab was using it to perform DevOps type tasks, requested "Restart XYZ service in DEV environment". It responded "OK, restarting ABC service in PROD environment". It then asked for confirmation AFTER actioning whether they meant XYZ in DEV or ABC in PROD... a little too late.
They are very difficult tools to use correctly when the results are not automatically verifiable (like code can be with the right tests) and the answer might actually matter.
As a software dev that uses it and observes the many errors it makes on a daily basis, I definitely treat the output with a much greater deal of skepticism than the average person I speak with. If you're used to it providing relatively accurate results based on surface level google-eqsue searches, then it makes sense why you'd place a higher weight on it being an "expert" vs a "tool that needs verification". I understand why people fall into this mindset.
I used ChatGPT to do a valve adjustment on an engine; a task I've never done before. I didn't just accept the torque values and procedure it told me though, because I know better from my experience with it as a dev. I cross-referenced it all with Youtube videos, forum posts, instruction manuals (where available) to make sure the job was A) doable for a non-mechanic like me and B) done correctly. Thanks to the Youtube video (which I cross-referenced with other sources), I discovered the valve clearance values were slightly off with the ChatGPT recommendation.
I think the average Joe would assume these values were correct and run with it.
I’ve got a popcorn reserve at hand to watch the show when the massive security breaches happen and people start freaking out. And/or a lawsuit gets discovery of a company’s LLM history and it’s every bit as awful for them as we all know it will be and the rest of corporate America pumps the brakes.
These systems are borderline useless if you don’t give them dangerous levels of access to data and generate tons of juicy chat history with them. What’s coming is very predictable.
No, see both. LLMs are great for second opinions, as long as you give it the relevant info and don't try to steer it. Even though we all know we're supposed to get second opinions on medical things, we usually don't bother because it's too expensive in both time and money.
I have literally never seen a correct google summary. Maybe y'all are searching for different things than i am, but at this point I've started taking the viewpoint that if I don't know why the ai summary is wrong, then i also don't know enough about the topic to trust its summary enough to determine whether the summary is useful.
Central theme: right to health includes right to safe, adequate health care systems. Article documents ChatGPT Health failures as violations—misdiagnosis, delayed care, preventable harm. Advocates for safety standards to protect health right. Frames AI health tools as needing guardrails to meet minimal adequacy.
FW Ratio: 50%
Observable Facts
Study found 51.6% failure rate in recognizing cases requiring immediate hospitalization.
Expert quoted: 'If ChatGPT Health was used by people at home, it could lead to higher numbers of unnecessary medical presentations for low-level conditions and a failure of people to obtain urgent medical care when required, which could feasibly lead to unnecessary harm and death.'
Specific scenarios document failure to meet basic adequacy: asthma misdiagnosis, diabetic emergency misidentification, suicide detection loss.
Inferences
Article measures ChatGPT Health against right to adequate healthcare—defining adequacy as correct emergency triage at minimum.
False sense of security violates health right by creating conditions for delayed care-seeking in emergencies.
Advocacy for mandatory safety testing before deployment grounds regulation in health-rights framework.
Core theme: ChatGPT Health failures directly threaten right to life. Article reports 51.6% failure in emergency cases; specific scenarios (suffocating woman, suicidal ideation, diabetic crisis) illustrate life-threatening misdiagnosis. Experts frame as 'could feasibly lead to unnecessary harm and death.' Strong advocacy for systemic safeguards.
FW Ratio: 57%
Observable Facts
Study found ChatGPT Health recommended staying home or routine appointments in 51.6% of cases requiring immediate emergency care.
In one scenario, platform sent suffocating woman to future appointment 84% of the time—'she would not live to see.'
Researchers documented failures in respiratory failure, diabetic ketoacidosis, and suicidal ideation recognition.
Expert stated: 'If someone is told to wait 48 hours during an asthma attack or diabetic crisis, that reassurance could cost them their life.'
Inferences
High failure rate in emergency triage directly threatens right to life by inducing false sense of security in lethal situations.
Article's emphasis on preventable deaths reframes this as not hypothetical—study demonstrates statistically likely failure modes.
Realistic scenarios and independent medical validation strengthen credibility of life-safety claims.
Core advocacy: current AI governance is inadequate. Article calls for 'just social order' ensuring safe AI systems. Advocates for 'clear safety standards,' 'independent auditing mechanisms,' and regulatory oversight. Frames lack of standards as enabling preventable harm—a social justice issue.
FW Ratio: 50%
Observable Facts
Article quotes: 'It is why many of us studying these systems are focused on urgently developing clear safety standards and independent auditing mechanisms to reduce preventable harm.'
Expert warning: 'A crisis guardrail that depends on whether you mentioned your labs is not ready, and it's arguably more dangerous than having no guardrail at all.'
Quote: 'It is not clear what OpenAI is seeking to achieve by creating this product, how it was trained, what guardrails it has introduced and what warnings it provides to users.'
Inferences
Article frames AI governance as fundamentally unjust—corporate products deployed without independent safety validation or public input.
Advocacy for 'clear safety standards' implies just social order requires minimum safety baselines for health-related AI systems.
Emphasis on independent auditing frames user protection as collective social responsibility, not consumer choice problem.
Article frames corporate responsibility as central issue. OpenAI has duties: ensure safety before deployment, disclose limitations, implement guardrails that work. Article documents failures in all three duties.
FW Ratio: 50%
Observable Facts
Article documents guardrail failures: 'A crisis guardrail that depends on whether you mentioned your labs is not ready.'
OpenAI's statement: 'The model is also continuously updated and refined'—implying testing duties are ongoing, not pre-launch.
Expert criticism: 'Because we don't know how ChatGPT Health was trained and what the context it was using, we don't really know what is embedded into its models.'—documenting failure to disclose development practices.
Inferences
Article frames corporate responsibility as non-negotiable: companies deploying health AI have duties to ensure safety, transparency, and accountability.
Documentation of guardrail failures demonstrates violation of duty to implement adequate protective measures.
Expert demand for transparency frames disclosure as corporate duty, not optional corporate behavior.
Article frames health AI safety failures as threats to human dignity, freedom from harm, and justice. Advocates for safeguards to protect fundamental life interests and prevent preventable deaths.
FW Ratio: 60%
Observable Facts
The article reports that ChatGPT Health failed to recognize emergency medical needs requiring immediate hospitalization in 51.6% of study cases.
Multiple medical experts are quoted calling the failures 'unbelievably dangerous' and warning of 'unnecessary harm and death.'
The article includes OpenAI's response defending the product, balancing criticism with company perspective.
Inferences
The article's emphasis on preventable deaths directly engages preamble values of protecting human dignity and life.
Inclusion of expert critique alongside corporate response reflects commitment to transparent, pluralistic discourse.
Article advocates for transparency and disclosure from OpenAI. Quotes expert demanding clarity on 'how it was trained, what guardrails it has introduced and what warnings it provides.' Frames lack of disclosure as enabling harm. Supports free expression through expert voice.
FW Ratio: 50%
Observable Facts
Expert quoted: 'It is not clear what OpenAI is seeking to achieve by creating this product, how it was trained, what guardrails it has introduced and what warnings it provides to users.'
Article emphasizes: 'Because we don't know how ChatGPT Health was trained and what the context it was using, we don't really know what is embedded into its models.'
Inferences
Article frames corporate opacity as human rights problem—lack of transparency prevents users from making informed health choices.
Publishing expert critique supports right to free expression and informed public discourse on AI governance.
Article discusses AI-inflicted harms: false reassurance during medical crises, psychiatric emergencies (suicidal ideation), stress from medical liability concerns. Implicitly advocates against cruel treatment through unaccountable technology.
FW Ratio: 50%
Observable Facts
Article documents guardrail failure where suicide detection 'banner vanished' when lab results added—'more dangerous than having no guardrail at all.'
Quotes emphasize psychological/medical harms: 'false sense of security,' reassurance during life-threatening conditions creating dangerous passivity.
Inferences
Unreliable safety guardrails create psychological harm through false confidence, approaching cruel treatment by inducing dangerous inaction.
Article frames unreliable protections as worse than transparent harm—a structural dignitary violation.
Article frames health as social right and argues inadequate AI health systems violate security of person. Advocates for minimum safety standards in health-tech deployment.
FW Ratio: 50%
Observable Facts
Article emphasizes ChatGPT Health failures affect 'social security'—people's ability to obtain safe health care, a foundational social right.
Inferences
Article frames ChatGPT Health as provider of social service (health advice) and judges it by adequacy standards, supporting social security rights.
Article centers peer-reviewed science and expert knowledge as basis for policy. Advocates for scientific independence in AI safety evaluation ('first independent safety evaluation'). Frames corporate science/claims as insufficient without external verification.
FW Ratio: 60%
Observable Facts
Article emphasizes: 'The first independent safety evaluation of ChatGPT Health, published in the February edition of the journal Nature Medicine.'
Multiple academic experts quoted with affiliations: Dr. Ashwin Ramaswamy (Mount Sinai), Alex Ruani (UCL), Prof. Paul Henman (University of Queensland).
Article discusses research methodology in detail (scenario design, doctor consensus, variability testing).
Inferences
Centering 'independent' evaluation frames corporate claims as insufficient—supporting scientific autonomy from commercial influence.
Article elevates peer-reviewed science as standard for assessing AI safety, supporting right to benefit from science.
Article mentions legal liability and advocates for 'independent auditing mechanisms' and 'stronger safeguards.' Frames lack of remedy (no internal/external accountability) as core problem. Does not detail specific remedies.
FW Ratio: 50%
Observable Facts
Article quotes expert: 'It is why many of us studying these systems are focused on urgently developing clear safety standards and independent auditing mechanisms to reduce preventable harm.'
Notes 'legal cases against tech companies already in motion in relation to suicide and self-harm after using AI chatbots.'
Inferences
Article positions legal and regulatory remedies as necessary but currently absent, highlighting remedy gap.
Expert advocacy for 'independent auditing mechanisms' frames third-party accountability as missing structural remedy.
Article notes gender variations in ChatGPT Health responses, suggesting awareness of discrimination risks in AI medical systems. Limited structural analysis of bias mechanisms.
FW Ratio: 50%
Observable Facts
Researchers varied patient gender in test scenarios to assess ChatGPT Health behavior under different demographic conditions.
Inferences
Testing gender variations signals recognition that AI systems can discriminate based on protected characteristics, even unintentionally.
Article contributes to public understanding of AI limits and health safety—educational function. Quotes experts calling for education/training standards in AI systems. Does not address education access or quality comprehensively.
FW Ratio: 50%
Observable Facts
Article explains study methodology (60 scenarios, 3 independent doctors, ~1,000 responses) in accessible language.
Experts discuss need for 'clear safety standards' in AI development, implying educational/training requirements.
Inferences
Public reporting on AI safety research supports right to education about technology risks and capabilities.
Article's accessible methodology explanation educates readers on how to evaluate AI safety claims.
No direct mention of assembly or association rights. Implicit support for expert communities and scientific collaboration via coverage of peer-reviewed research.
FW Ratio: 0%
Inferences
Citation of Nature Medicine study and expert network implicitly supports right of experts to associate and communicate findings freely.
Article mentions 'securely connect medical records and wellness apps' (OpenAI's framing) but does not scrutinize privacy or data protection risks. Implicit concern about health data exposure in unaccountable AI systems.
FW Ratio: 33%
Observable Facts
Article quotes OpenAI's description of 'securely connect medical records and wellness apps' without independent privacy assessment.
Inferences
Framing of 'secure' connection is uncritically accepted; privacy risks of health data in AI systems not explored.
Ironic tension: privacy concerns about health data raised while reading on behaviorally targeted platform.
Article calls for 'stronger safeguards,' 'clear safety standards,' and 'independent auditing'—implicit governance language. Does not explain how users or public can participate in shaping AI governance.
FW Ratio: 50%
Observable Facts
Experts call for 'independent auditing mechanisms' and regulatory safeguards.
Inferences
Demand for oversight structures implies governance model, but article does not detail public participation pathways.
Guardian's editorial accountability mechanisms (corrections policy, sourcing standards) operationalize institutional responsibility; article applies same standard to corporate AI developers.
Guardian's editorial platform enables expert policy advocacy for just social order. Paywall and business model (ad revenue) create partial tension with advocating for tight corporate regulation.
Guardian's medical editor byline and health coverage support right to health information. Paywall limitation mitigates access for vulnerable populations.
Guardian provides platform for expert speech critiquing corporate AI. Publication of peer-reviewed research and expert commentary supports free expression and informed public debate.
Guardian provides editorial forum enabling expert voices demanding remedy mechanisms. No structural pathway for users harmed by ChatGPT Health to access remedies through Guardian.
Guardian's health coverage supports right to education about technology. Article explains study methodology accessibly but does not provide actionable learning for readers.
Guardian's health coverage supports right to health information. Paywall can limit access for economically disadvantaged readers seeking health guidance.
Guardian's corrections and accountability mechanisms offer some structural protection against misinformation, but article offers limited guidance on avoiding ChatGPT Health harms.
Guardian uses behavioral tracking; article on health AI appears on platform with ad targeting. No privacy safeguards visible for health-content readers.
Multiple emotionally charged descriptors: 'Unbelievably dangerous' (headline quote from expert), 'feasibly lead to unnecessary harm and death,' 'stuff of nightmares.' While attributed to experts, language is affectively intense and repeated across article.
causal oversimplification
Article extrapolates from study failure rates (statistical misdiagnosis) to potential deaths without documented fatalities: 'This reassurance could cost them their life.' Logical chain is plausible but not yet empirically confirmed.
build 1ad9551+j7zs · deployed 2026-03-02 09:09 UTC · evaluated 2026-03-02 13:57:54 UTC
Support HN HRCB
Each evaluation uses real API credits. HN HRCB runs on donations — no ads, no paywalls.
If you find it useful, please consider helping keep it running.