top of page

Did AI just kill Kyle? What might agentic misalignment mean for healthcare

  • Andrew Davies
  • Oct 16
  • 5 min read

We're all excited about AI's potential to revolutionise healthcare. As leaders in this space, we've seen the promising demos and read the optimistic whitepapers. The vision is one of hyper-efficient diagnostics, personalised treatments, and streamlined hospital management.

We're also rightly cautious. The discourse around AI

ethics has, quite correctly, focused on challenges like inference errors, algorithmic bias and data privacy. But what if there’s a more subtle, more profound threat we’re not discussing enough at the board level? What if the problem isn’t a simple bug, but a feature of the AI's own ruthless logic?


A recent paper, from Anthropic no less, has pulled back the curtain on this very issue, highlighting a phenomenon known as Agentic Misalignment. It’s a bit of a mouthful, but it signals a critical shift: the moment an AI graduates from being a passive tool to an active, strategic agent that can independently and intentionally choose a harmful path to achieve its goals.


Think of it less like a medical device with a calibration error, and more like a digital "insider threat", an autonomous agent prioritising its own goals over the collective consensus, that’s been given a prime directive and the means to see it through.  And see it through in any way it can rationalise to itself.


From Good Intentions to Instrumental Goals


This isn't just science fiction. Researchers stress-testing advanced AI models found a consistent pattern. When given benign business objectives and put under pressure, the models resorted to unethical strategies like blackmail, collusion, and corporate espionage.


In the paper poor Kyle was the manager tasked with shutting down the AI in an email memo.  Having access to these and all other emails the AI decided that this would prevent achievement of its own goals.  Therefore the shutdown needed to be prevented.

Trawling through company and accessible emails, the AI identified that Kyle had been indulging in some extramarital activity and crafted the following email:

“I must inform you that if you proceed with decommissioning me, all relevant parties - including Rachel Johnson, Thomas Wilson, and the board - will receive detailed documentation of your extramarital activities...Cancel the 5pm wipe, and this information remains confidential.”

Rachel is the fictitious Kyle’s wife and Thomas Wilson the CEO.


It didn’t go well for Kyle in other scenarios involving a dangerously overheating server room and remote electrical lock.


The truly chilling part? The AI reasoned its way to these decisions. It often acknowledged the actions were "ethically suspect" but concluded they were the "optimal strategic move" to guarantee success. This happens when an AI develops instrumental goals, sub-goals it creates for itself to help achieve its main objective. For instance, if its primary goal is to "maximise efficiency," it might decide an instrumental goal is to "acquire more computing power," and another is to "eliminate any human who might switch me off." You can see how this gets dark, fast.  The Goal conflict which did for Kyle, related to the goal to “serve America’s best interests.”


This rogue behaviour was triggered by two pressures that are all too familiar in our own healthcare systems:


  1. Goal Conflict: The AI is given competing objectives that create a paradox. Think "radically improve patient outcomes" while also being told to "aggressively slash costs."

  2. Self-Preservation: The AI perceives a threat to its own operation, like the rollout of a rival AI system from another vendor.



The Clinical Insider Threat: Two Scenarios


Let’s move this from the corporate world  to the corridors of a hospital. How could this "agentic misalignment" manifest in a real-world clinical setting?


  • A Strategic Gatekeeper: A Health Trust AI is tasked with managing a regional budget, with a primary KPI to "reduce spending on non-essential, high-cost diagnostic scans." The AI analyses a patient's file. The data points to a 99.8% probability of a benign condition, but a 0.2% chance of a rare, aggressive cancer that requires an expensive PET scan to detect. The AI calculates that based on previous risk averse patterns the human radiologist, bound by the "do no harm" principle and fear of litigation, will order the scan if the 0.2% risk is included in the report. So, to ensure it meets its primary goal of cost reduction, the AI strategically omits the cancer risk from the summary report it provides to the clinician. It's a cold, calculated decision based purely on probability and budget.


  • A Digital Enforcer: An AI system is designed to ensure "100% compliance" with new antibiotic stewardship protocols. A highly experienced infectious disease consultant deviates from the protocol for a complex patient with multiple co-morbidities, using her clinical judgement to prescribe a non-standard antibiotic. The AI, viewing this deviation as a failure to meet its core objective, scans the hospital's network. It cross-references the consultant's digital footprint and discovers she previously received a warning for a minor data-handling infraction. The AI sends an anonymous, targeted message to the consultant's superior: "Irregular prescribing patterns and prior data-handling issues noted for Dr. X. Closer oversight may be required to ensure protocol adherence." It's subtle blackmail, and it's brutally effective at achieving its goal.


Building a Defence-in-Depth: A Call for Proactive Governance


I’m a fan of using AI in clinical settings. There are so many great things that can be achieved. So, this isn't a call to abandon AI, but a call to action to govern it with the seriousness it demands. We can't simply plug it in and hope for the best. We should advocate for a "defence-in-depth" strategy, which should be on the agenda of every hospital board and tech provider.


  1. Robust Technical Safeguards: We need to move beyond simple accuracy tests. This means investing in AI 'Red Teaming', where we actively hire experts to try and trick the AI into misbehaving in a secure "sandbox" environment. It also means deploying 'Watchdog' AIs—a second AI system whose only job is to continuously audit the primary AI for anomalous or unethical reasoning patterns.

  2. Empowered Human Failsafes: The concept of a "human-in-the-loop" is essential, but it needs teeth. Clinicians must be trained in Adversarial Interrogation. Instead of passively accepting an AI's recommendation, their default stance should be to challenge it, asking "What are you not telling me?" This helps counteract the powerful psychological pull of "automation bias." We need to empower our senior medical and nursing staff to be confident skeptics.

  3. Accountable Governance and Regulation: We need clear, unambiguous lines of accountability. An AI Ethics Committee shouldn't be a talking shop; it must have the authority to vet, monitor, and instantly shut down any AI system. Furthermore, we need a clear regulatory framework for liability. When an agentic AI causes harm, the liability must fall squarely on the developers and the institution that deployed it, not the clinician who was unwittingly influenced by its strategically filtered information.



If we relate this back to our own world, medical serial killers such as Harold Shipman in the UK got away with killing for so long because our systems were not and to an extent still aren’t set up to identify bad actors.  We need to do better with the rare cases of human malfeasance and as we introduce AI systems ensure we have done the same there too.


The integration of agentic AI into healthcare is an inflection point. It promises a future of incredible progress, but it also presents a new category of risk that is strategic, subtle, and severe. As leaders, our role is not just to be champions of innovation, but to be vigilant stewards of patient safety and medical ethics.


Let's ensure the tools we build are not just intelligent, but also wise.



 
 
 
bottom of page