“The question in healthcare AI is never simply whether the AI is better than the human. Often it is. The question is whether the system in which the AI is embedded — the clinical workflow, the oversight mechanism, the accountability structure — is designed to capture the AI’s benefits while preventing its failure modes from reaching patients. That system design question is the one that is not being answered fast enough.”
Neal Lloyd · Inside The Machine, Day 22In 2023, a landmark study in the Lancet published results showing that an AI system trained by DeepMind could detect breast cancer in mammograms with greater accuracy than two radiologists working in combination — reducing both false positives and false negatives simultaneously. The study was peer-reviewed, pre-registered, and conducted across a large independent dataset. The finding was real. It was also, in the context of the broader healthcare AI landscape, representative of a genuinely important truth: AI systems, in specific, well-defined domains with high-quality training data and clear diagnostic criteria, can outperform human clinical specialists at specific tasks. This is not a hypothetical future capability. It is a documented present reality. It is also only part of the story, and the part of the story that gets the most attention is not always the part that most urgently needs it. This is Day 22 of Inside The Machine, and today we look at AI in healthcare with the same clarity we have tried to bring to every other domain in this series: the genuine achievements, the genuine failure modes, and the governance gap in between.
The Applications With Real Clinical Evidence Behind Them
Medical imaging and diagnostics. The strongest clinical evidence for AI in healthcare is in medical imaging — the analysis of X-rays, CT scans, MRIs, and pathology slides. AI systems have demonstrated clinically significant performance in detecting diabetic retinopathy (a leading cause of blindness), lung cancer (earlier detection at smaller nodule sizes), breast cancer, skin cancer, and cardiovascular disease risk from retinal photographs. The performance advantage is not universal — in some settings and some demographic groups, AI systems perform no better or worse than specialists — but in the highest-quality studies, AI diagnostic tools are demonstrably adding clinical value. Google Health’s work on retinal imaging for cardiovascular risk prediction is a particularly striking example: the system can detect markers of heart disease from an eye scan that was not designed to carry that information, because it learned the statistical relationship from millions of training examples. No human clinician learned to do this from the same data because no human clinician could process millions of training examples in a career.
Drug discovery. AI is accelerating pharmaceutical development in ways that are beginning to produce real clinical outcomes. Insilico Medicine received FDA approval in 2024 for a clinical trial of a drug discovered entirely by AI — from target identification through molecule design — in approximately eighteen months. The conventional timeline for the same process is typically five to ten years. AlphaFold’s protein structure prediction capability, now widely integrated into pharmaceutical research workflows, has dramatically reduced the time and cost of identifying viable drug targets for diseases that have previously been undruggable because their protein structures were unknown. These are not incremental improvements. They are order-of-magnitude changes in the economics and timelines of drug development.
Clinical documentation. The application of AI to the administrative burden of clinical documentation is the healthcare AI use case with the most immediate impact on physician wellbeing and, indirectly, on patient care. Physician burnout — substantially driven by the hours spent on electronic health record documentation — is one of the most significant drivers of healthcare workforce attrition. AI ambient documentation systems — tools like Nuance DAX, Suki, and their equivalents — listen to clinical encounters and generate accurate clinical notes automatically, giving back to physicians the two to three hours per day they previously spent on documentation. The early clinical studies are striking: physicians using AI documentation report significantly lower burnout scores, spend more time with patients, and produce notes that are equally or more accurate than those they wrote manually. This is the healthcare AI application that is already deployed at significant scale and that most clinicians who have used it describe as genuinely transformative.
Diabetic retinopathy detection: AI matches or exceeds specialist performance in multiple peer-reviewed studies. Breast cancer screening: DeepMind study shows AI reduces false positives AND false negatives vs two radiologists in combination. Lung cancer detection: AI identifies earlier-stage nodules missed by radiologists in head-to-head studies. Drug discovery: Insilico Medicine — first fully AI-discovered drug in FDA clinical trial, 18-month timeline vs 5-10 year conventional. AlphaFold: protein structure prediction now integrated into workflows at all major pharma companies. Clinical documentation: 2-3 hours per day returned to physicians. Burnout scores decrease significantly in studies. Patient time increases. These results are real. They are also domain-specific. The failure modes, when they occur, are in different domains.
The Failure Modes That Exist Alongside the Achievements
Demographic bias in diagnostic AI. The hallucination risk in medical AI is not primarily the risk that an AI system generates a completely wrong answer. It is the more subtle and more dangerous risk that an AI system generates systematically wrong answers for specific patient groups. Training datasets for medical AI systems have historically underrepresented women, Black patients, elderly patients, and patients with multiple comorbidities. AI systems trained on these datasets learn to perform well on the overrepresented groups and less well on the underrepresented ones — exactly the patients who are most likely to be already underserved by the healthcare system. The result is that AI systems can amplify existing healthcare disparities rather than reduce them, performing less accurately for the patients who most need accurate diagnosis.
Overreliance and automation bias. When AI systems are introduced into clinical workflows, there is a documented tendency for clinicians to defer to AI recommendations even when their own clinical judgment suggests a different conclusion — a phenomenon called automation bias. This is the inverse of the problem highlighted in Day 12’s accountability discussion: instead of an AI overriding a human, a human is overriding their own judgment because the AI’s confidence is high. When the AI is right, automation bias is harmless. When the AI is wrong, automation bias converts an AI error into a clinical error that might otherwise have been caught by the clinician’s independent judgment. The presence of a confident AI recommendation in the clinical workflow can, counterintuitively, make the overall system less safe than the unassisted clinician in specific failure modes.
LLM hallucination in clinical contexts. As covered in Day 16, large language models hallucinate — generating confident, fluent, wrong information — in ways that are particularly dangerous in medical contexts. AI clinical decision support tools that use LLMs under the hood can generate drug dosage recommendations, drug interaction warnings, and differential diagnosis suggestions that are plausible-sounding and incorrect. The hallucination risk discussed in the legal context — wrong case citations delivered with confidence — is at least as serious in the medical context. A wrong drug dosage delivered with clinical confidence, in a workflow where the clinician has been conditioned by automation bias to trust AI recommendations, is a patient harm event waiting to happen.
The access and equity gap. The most sophisticated AI diagnostic tools are being deployed primarily in well-resourced healthcare systems in high-income countries — often in the specific settings that were already best resourced. The radiologist practices with the most advanced imaging technology are the ones most likely to add AI diagnostic assistance. The rural clinics, the community health centres, and the healthcare systems in low-income countries — where the need for diagnostic support is most acute — are the least likely to have access to these tools. AI in healthcare risks improving care for the already well-served and leaving the underserved further behind.
The AI diagnostic tool that performs at specialist level on the majority patient demographic and at a significantly lower level on minority demographic groups is not a neutral technology. It is a technology that adds value to the care of the majority while subtly degrading the care of groups that were already receiving lower-quality care. That is not a hypothetical risk. It is a documented pattern in deployed systems. Addressing it requires not just better AI but better data, better evaluation, and better procurement standards.Neal Lloyd · Inside The Machine, Day 22
What Oversight Exists, What It Misses, and Why It Is Not Keeping Up
The FDA regulates AI medical devices — software that meets the definition of a medical device under US law — through a pathway that has evolved significantly since 2021. The FDA’s AI/ML-Based Software as a Medical Device (SaMD) framework requires pre-market review for AI systems that meet certain risk thresholds, post-market surveillance, and transparency about the algorithm’s training data and performance characteristics. As of June 2026, the FDA has cleared or approved over 900 AI-enabled medical devices. This is a meaningful regulatory infrastructure, and it is more than most other countries have managed to put in place.
What it does not adequately cover: the general-purpose LLMs being integrated into clinical workflows that do not meet the definition of a medical device because they are positioned as clinical decision support rather than diagnostic tools; the AI ambient documentation systems that learn from clinical data and generate notes that become part of the permanent medical record; the AI chatbots deployed by healthcare systems for patient-facing communication; and the AI-assisted administrative systems that affect whether and when patients receive care. These systems operate in a regulatory grey zone that the FDA framework was not designed to address. The clinicians and health systems deploying them are making risk assessments without adequate regulatory guidance and without the post-market surveillance infrastructure that would detect population-level harm signals.
The EU AI Act’s approach to healthcare AI is more comprehensive — medical AI is a default high-risk category under the Act, requiring conformity assessment, transparency documentation, and post-market monitoring regardless of whether it meets the device definition — but the August 2026 enforcement date is still weeks away and the implementation guidance for healthcare-specific provisions is still being developed. The most consequential healthcare AI deployments are happening faster than the regulatory frameworks designed to govern them can be written and implemented.
The Principles That Actually Work, From the Settings That Have Applied Them
Prospective clinical validation, not just retrospective benchmarking. Most AI diagnostic tools are validated retrospectively — tested against historical datasets in controlled settings. What matters clinically is prospective validation: how does the system perform when deployed in a real clinical workflow, with real patients, over time, including patients who differ from the training distribution? The gap between retrospective benchmark performance and real-world clinical performance is consistently large and consistently in the direction of the tool performing worse than the benchmark suggests. Requiring prospective clinical validation as a condition of deployment significantly raises the standard — and is the difference between knowing a tool works in a lab and knowing it works in a ward.
Demographic performance disaggregation as a mandatory disclosure. AI medical tools should be required to disclose their performance characteristics not just as population averages but disaggregated by age, sex, race, and ethnicity. A tool that achieves 95% accuracy across the population and 80% accuracy for Black women is not a 95% accurate tool for Black women. The disaggregated number is the one that matters for clinical decisions affecting that patient. This information is frequently available to developers and rarely disclosed to clinicians or regulators at the required level of specificity.
Human-in-the-loop requirements that actually specify what the human should do. The “human in the loop” requirement that appears in most AI clinical guidance is necessary but insufficient. A clinician who reviews an AI recommendation without the time, training, or cognitive resources to critically evaluate it is providing a rubber stamp, not oversight. Meaningful human oversight in clinical AI requires specifying what the human review should involve, what training is required to perform it, and what workflow design is needed to prevent automation bias from making the human review perfunctory.
AI in healthcare is simultaneously the most genuinely life-saving and the most insufficiently governed AI application in existence. The gap between those two things is not a technology problem. It is a governance problem — and it is a problem that is being solved too slowly, by institutions that are not designed to move at the speed the technology is requiring them to. The patients in the middle of this deserve better than that. They are not currently getting it.Neal Lloyd · Inside The Machine, Day 22
Inside The Machine, Day 22 · June 20 2026
Neal Lloyd writes about technology, human adaptation, and the uncomfortable questions nobody wants to answer at dinner. Inside The Machine is his ongoing daily series on AI.
- Day 01What Is This Thing?
- Day 02Survive the Machine
- Day 03The Great Debate
- Day 04Who Gets Hurt?
- Day 05Who’s In Charge?
- Day 06The Industries That Win
- Day 07The Human Edge
- Day 08The Creativity Question
- Day 09Does AI Feel Anything?
- Day 10The Data Problem
- Day 11The Trust Question
- Day 12The Accountability Gap
- Day 13The Rewired Brain
- Day 14Open vs Closed
- Day 15The New Cold War
- Day 16Why AI Lies With Confidence
- Day 17AI Is Eating the Power Grid
- Day 18The Age of AI Agents
- Day 19AI Safety Was Never Just Theory
- Day 20The Surveillance Question
- Day 21AI and the Future of Education
- Day 22AI and Your HealthYou are here



