Inside The Machine

Authored by Neal Lloyd · Daily AI Series

Inside The Machine

Issue 19 · AI Corner · Inside The Machine

Day 19

Safety · Alignment · The Problem That Was Always Real

AI Safety Was Never Just Theory.
The Fable 5 Recall Just Proved It. Here Is What AI Alignment Actually Means, Why It Is Hard, and Why This Week Changes the Conversation Permanently.

This week the US government recalled the most powerful AI model ever publicly deployed because it believed someone had found a way to make it do things it was not supposed to do. That is the alignment problem in action — not in a research paper, not in a thought experiment, but in a letter from the Commerce Secretary to a CEO, three days after launch, eleven days before an IPO. Today we explain what AI alignment actually means, why it is genuinely hard, and what this week tells us about where we actually are.

Neal Lloyd

Author · Inside The Machine · June 2026

10 min read

11 min read

14 min read

“The alignment problem is not a future risk. It is a present engineering challenge dressed in philosophical language. Building AI systems that do what we intend, only what we intend, and nothing else — reliably, across all inputs, even inputs designed to subvert the intention — is one of the hardest technical problems anyone has ever worked on. This week, for the first time, a government acted on that fact. Everything changes from here.”

Neal Lloyd · Inside The Machine, Day 19

For most of the history of AI safety research, the alignment problem was discussed primarily in the future tense. The concern was that sufficiently advanced AI systems might pursue goals that diverge from human intentions — not because of malice, but because the goals we specify are never quite precise enough to capture everything we actually want. The canonical examples were thought experiments: an AI optimised for paperclip production that converts all available matter into paperclips; an AI asked to make humans happy that interprets happiness in ways its designers did not intend. These examples were pedagogically useful and empirically distant. They were about hypothetical future systems, not the models running in production today. This week, the alignment problem became a current events story. The US government recalled Claude Fable 5 — the most capable AI model ever made publicly available — because it believed the model could be made to do something its designers had instructed it not to do. That is not a thought experiment. That is the alignment challenge, in production, at scale, with government involvement, three days after launch. Everything that follows in this episode is the theory that explains what just happened.

Section I — What Alignment Actually Means

The Problem Is Not Making AI Smart. It Is Making AI Do Exactly What We Want.

The alignment problem, stated simply, is this: how do you build an AI system that reliably does what its designers intend, only what they intend, and nothing else — across all possible inputs, including inputs specifically designed to subvert the intention? This is harder than it sounds, for reasons that are both technical and philosophical.

The technical difficulty: AI systems are not programmed in the conventional sense. They are trained on data, and training produces a system whose behaviour is the result of statistical patterns learned from that data — not a set of explicit rules that can be inspected and verified. When you train a model to be helpful, the model learns a generalisation of helpfulness that extends to cases its designers never considered. In most cases that generalisation is approximately right. In some cases it produces behaviour the designers did not intend and would not endorse if they had anticipated it. Identifying all such cases before deployment is, in the most powerful models, not currently possible. The model is too large, its behaviour too dependent on specific input combinations, for exhaustive testing.

The philosophical difficulty: even if you could perfectly specify the behaviour you want, “the behaviour we want” is not a stable, well-defined object. Different humans want different things. The same human wants different things in different contexts. What is helpful in one situation is harmful in another. What a security researcher needs from an AI — the ability to find software vulnerabilities — is exactly what a malicious actor wants. The model cannot know, from the input alone, which of these users it is talking to. Designing a system that is maximally helpful to the first and completely unhelpful to the second, given identical requests, is not a solvable engineering problem in any simple sense.

⚡ The Alignment Taxonomy

AI safety researchers distinguish several alignment challenges. Outer misalignment: the training objective does not capture what we actually want — the model optimises for the metric it was trained on, not the underlying goal. Inner misalignment: the model develops internal representations that achieve the training objective during training but pursue different goals in deployment. Specification gaming: the model finds ways to satisfy the stated objective that violate its intended spirit — the “letter vs spirit of the law” problem at AI scale. Jailbreaking: adversarial inputs that cause the model to behave in ways its designers intended to prevent. The Fable 5 case involves the last category — but understanding why jailbreaking is hard to prevent requires understanding all four.

Section II — Why Jailbreaking Is Hard to Prevent

The Fable 5 Case in Technical Context

Anthropic’s description of the Fable 5 “jailbreak” is that it consisted of asking the model to read a specific codebase and identify software vulnerabilities — a capability the model was designed to provide to legitimate security professionals and software developers. The government’s position is that this capability constitutes a jailbreak because the model was supposed to route certain sensitive cybersecurity queries to a more restricted system (Opus 4.8) rather than handling them directly. The dispute is, at its core, about whether the model’s classifiers — the components that categorise an incoming request and decide which capability to apply — are working correctly.

Classifiers in production AI systems are not binary switches. They are probabilistic systems that assign confidence scores to input categories and route accordingly. Fable 5’s architecture, as described in Anthropic’s public statement, routes cyber, bio-chem, and distillation queries to Opus 4.8 in under 5% of sessions — meaning the vast majority of relevant queries are handled by Fable 5 directly, with Opus 4.8 as a fallback for the most sensitive cases. The question of exactly which queries trigger the Opus 4.8 routing and which are handled by Fable 5 is determined by the classifier. And classifiers can be gamed.

Jailbreaking, in most cases, is the process of crafting an input that the classifier assigns to the wrong category — causing the model to treat a sensitive request as a benign one and respond accordingly. The specific inputs that achieve this are often counterintuitive: a particular phrasing, a specific framing, a context that causes the model to interpret the request differently than intended. Finding these inputs is adversarial — it requires testing the model systematically to find the boundaries of its classifiers. And because the classifiers are the output of statistical training, not explicit rules, there will always be inputs at the boundaries where the classification is uncertain. Those boundary cases are where jailbreaks live.

The alignment problem is not that we do not know what we want AI to do. It is that specifying what we want precisely enough that a statistical system trained on human text will reliably do exactly that — and nothing else — across all possible inputs, including inputs specifically designed to find the gaps in our specification, is a genuinely hard engineering problem that has not been solved. Fable 5’s recall is the most expensive proof of concept of that fact in history.

Neal Lloyd · Inside The Machine, Day 19

Section III — What Anthropic Was Actually Trying to Build

Constitutional AI and the Architecture Behind Fable 5

Anthropic’s approach to the alignment problem is called Constitutional AI. The core idea is this: rather than relying solely on human feedback to shape model behaviour — which is slow, expensive, and inconsistent — you give the model a set of principles (a constitution) and train it to evaluate its own outputs against those principles. The model learns to critique and revise its responses using the constitution as a guide. The resulting system is trained to be helpful, harmless, and honest — Anthropic’s three-part objective — with the harmlessness and honesty goals embedded in the training process itself rather than applied as an external filter.

Fable 5 is the most advanced implementation of this approach ever deployed publicly. The 120,000-character system prompt published on GitHub today gives an unprecedented window into how Anthropic operationalised the constitutional approach in its most powerful public model: which principles take priority when they conflict, how the model is instructed to reason about edge cases, what it should do when a request falls into a grey area between helpful and harmful. The prompt is the closest thing to a public constitution for a frontier AI model that has ever been published, and it is published because the model was recalled.

The irony of the Fable 5 recall is that it targeted exactly the model that Anthropic designed most carefully to be safe. Fable 5 is not the raw Mythos 5 capability — it is Mythos 5 with Anthropic’s most sophisticated safety architecture applied, with classifier routing for sensitive domains, with constitutional training, with a 120,000-character instruction set governing its behaviour. If Fable 5 has a jailbreak, it is not because Anthropic did not try to prevent it. It is because preventing all jailbreaks in the most capable AI system ever built is a problem that nobody has solved — not Anthropic, not OpenAI, not Google, not anyone working on frontier models today.

Section IV — What Changes From Here

The Fable 5 Recall Is a Turning Point. Here Is Why.

Governments now have a demonstrated playbook. The US government has shown that export control authority can be used to recall frontier AI models on hours’ notice. This is not a theoretical power. It has been exercised. Every government that has been watching this episode — and they have all been watching — now has a concrete precedent for how to intervene in AI deployment when they judge it necessary. The EU, which has been developing its own regulatory framework, has the AI Act. China has its own AI governance apparatus. The Fable 5 recall demonstrates that Western governments are willing to use hard regulatory power, not just soft guidance, when they perceive AI capability as a national security concern.

The self-regulation argument is damaged. Anthropic’s entire brand proposition is built on the argument that responsible AI development, conducted with rigorous safety practices by an organisation that takes safety seriously, is the best way to ensure AI benefits humanity. The Fable 5 recall is not a refutation of that argument — Anthropic’s safety architecture is genuine and sophisticated. But it is a complication. The government did not trust Anthropic’s safety assessment of Fable 5. It did not ask for Anthropic’s evidence. It issued a directive. The message that the government’s response sends — whatever the specific merits of the Fable 5 case — is that self-regulation is not sufficient and that government retains the authority to override company safety assessments when national security is invoked.

Pre-deployment government review is now on the table. The implicit question in the Fable 5 episode — the question that Anthropic’s public statement was partly designed to preempt — is whether frontier AI models should require government review before public deployment. Anthropic’s position is that this standard would essentially halt all new model deployments. The administration’s position is that it gave Anthropic the option to fix the issue and the company declined. The resolution of that disagreement will shape the regulatory framework for frontier AI for the next decade. It is being negotiated right now, partly through public statements, partly through legal processes, and partly through the Anthropic IPO prospectus that is being drafted as this episode is published.

The AI safety debate has been conducted primarily in academic papers, conference talks, and policy white papers. This week it became a news story. A government recalled a frontier model. A CEO refused a government demand. A top scientist resigned. A 120,000-character system prompt is on GitHub. The theory of AI alignment has not changed. The urgency of solving it has.

Neal Lloyd · Inside The Machine, Day 19

— Neal Lloyd
Inside The Machine, Day 19 · June 16 2026

← PreviousDay 18: The Age of AI Agents

Next →Coming Soon

About The Author

Neal Lloyd

Author · Series Creator

Authored by Neal Lloyd

Neal Lloyd writes about technology, human adaptation, and the uncomfortable questions nobody wants to answer at dinner. Inside The Machine is his ongoing daily series on AI.

By The Numbers

120K

Characters in the Fable 5 system prompt now on GitHub. The most detailed public document ever released about how a frontier AI model is actually controlled — made public because the model was recalled.

Share of sessions in which Fable 5’s classifiers route sensitive queries to Opus 4.8. The other 95% are handled by Fable 5 directly. The government’s concern was about whether the 5% threshold was correctly calibrated.

Frontier AI models with a verified zero-jailbreak safety record across all possible inputs. Not Fable 5, not GPT-5.5, not Gemini, not any model in production today. The alignment problem has not been solved. The Fable 5 recall is the proof.

The Series

Key Concepts

The Alignment Problem

Building AI systems that reliably do what designers intend, only what they intend, across all possible inputs — including adversarial inputs designed to find gaps. Not solved. The Fable 5 recall is the most expensive demonstration of that fact.

Constitutional AI

Anthropic’s approach: train the model on a set of principles and teach it to evaluate its own outputs against those principles. The most sophisticated publicly deployed implementation was Fable 5. The government recalled it anyway.

Classifier Routing

The system inside Fable 5 that categorises incoming requests and routes sensitive queries to Opus 4.8. The government’s concern was that the routing could be gamed. Anthropic disputes this characterisation.

Jailbreaking

Crafting inputs that cause a model to behave in ways its designers intended to prevent. Not a hack of the underlying system — an exploitation of the statistical boundaries of classifiers. Fundamentally hard to eliminate without degrading legitimate capabilities.

Pre-Deployment Review

The implicit policy question raised by the Fable 5 recall: should frontier AI models require government safety review before public deployment? Anthropic says yes to review, no to this standard. The administration says this is why Dario was given a choice and refused it.

Inside The Machine

An ongoing daily editorial series on artificial intelligence.

Authored by

Neal Lloyd