“The alignment problem is not a future risk. It is a present engineering challenge dressed in philosophical language. Building AI systems that do what we intend, only what we intend, and nothing else — reliably, across all inputs, even inputs designed to subvert the intention — is one of the hardest technical problems anyone has ever worked on. This week, for the first time, a government acted on that fact. Everything changes from here.”
Neal Lloyd · Inside The Machine, Day 19For most of the history of AI safety research, the alignment problem was discussed primarily in the future tense. The concern was that sufficiently advanced AI systems might pursue goals that diverge from human intentions — not because of malice, but because the goals we specify are never quite precise enough to capture everything we actually want. The canonical examples were thought experiments: an AI optimised for paperclip production that converts all available matter into paperclips; an AI asked to make humans happy that interprets happiness in ways its designers did not intend. These examples were pedagogically useful and empirically distant. They were about hypothetical future systems, not the models running in production today. This week, the alignment problem became a current events story. The US government recalled Claude Fable 5 — the most capable AI model ever made publicly available — because it believed the model could be made to do something its designers had instructed it not to do. That is not a thought experiment. That is the alignment challenge, in production, at scale, with government involvement, three days after launch. Everything that follows in this episode is the theory that explains what just happened.
The Problem Is Not Making AI Smart. It Is Making AI Do Exactly What We Want.
The alignment problem, stated simply, is this: how do you build an AI system that reliably does what its designers intend, only what they intend, and nothing else — across all possible inputs, including inputs specifically designed to subvert the intention? This is harder than it sounds, for reasons that are both technical and philosophical.
The technical difficulty: AI systems are not programmed in the conventional sense. They are trained on data, and training produces a system whose behaviour is the result of statistical patterns learned from that data — not a set of explicit rules that can be inspected and verified. When you train a model to be helpful, the model learns a generalisation of helpfulness that extends to cases its designers never considered. In most cases that generalisation is approximately right. In some cases it produces behaviour the designers did not intend and would not endorse if they had anticipated it. Identifying all such cases before deployment is, in the most powerful models, not currently possible. The model is too large, its behaviour too dependent on specific input combinations, for exhaustive testing.
The philosophical difficulty: even if you could perfectly specify the behaviour you want, “the behaviour we want” is not a stable, well-defined object. Different humans want different things. The same human wants different things in different contexts. What is helpful in one situation is harmful in another. What a security researcher needs from an AI — the ability to find software vulnerabilities — is exactly what a malicious actor wants. The model cannot know, from the input alone, which of these users it is talking to. Designing a system that is maximally helpful to the first and completely unhelpful to the second, given identical requests, is not a solvable engineering problem in any simple sense.
AI safety researchers distinguish several alignment challenges. Outer misalignment: the training objective does not capture what we actually want — the model optimises for the metric it was trained on, not the underlying goal. Inner misalignment: the model develops internal representations that achieve the training objective during training but pursue different goals in deployment. Specification gaming: the model finds ways to satisfy the stated objective that violate its intended spirit — the “letter vs spirit of the law” problem at AI scale. Jailbreaking: adversarial inputs that cause the model to behave in ways its designers intended to prevent. The Fable 5 case involves the last category — but understanding why jailbreaking is hard to prevent requires understanding all four.
The Fable 5 Case in Technical Context
Anthropic’s description of the Fable 5 “jailbreak” is that it consisted of asking the model to read a specific codebase and identify software vulnerabilities — a capability the model was designed to provide to legitimate security professionals and software developers. The government’s position is that this capability constitutes a jailbreak because the model was supposed to route certain sensitive cybersecurity queries to a more restricted system (Opus 4.8) rather than handling them directly. The dispute is, at its core, about whether the model’s classifiers — the components that categorise an incoming request and decide which capability to apply — are working correctly.
Classifiers in production AI systems are not binary switches. They are probabilistic systems that assign confidence scores to input categories and route accordingly. Fable 5’s architecture, as described in Anthropic’s public statement, routes cyber, bio-chem, and distillation queries to Opus 4.8 in under 5% of sessions — meaning the vast majority of relevant queries are handled by Fable 5 directly, with Opus 4.8 as a fallback for the most sensitive cases. The question of exactly which queries trigger the Opus 4.8 routing and which are handled by Fable 5 is determined by the classifier. And classifiers can be gamed.
Jailbreaking, in most cases, is the process of crafting an input that the classifier assigns to the wrong category — causing the model to treat a sensitive request as a benign one and respond accordingly. The specific inputs that achieve this are often counterintuitive: a particular phrasing, a specific framing, a context that causes the model to interpret the request differently than intended. Finding these inputs is adversarial — it requires testing the model systematically to find the boundaries of its classifiers. And because the classifiers are the output of statistical training, not explicit rules, there will always be inputs at the boundaries where the classification is uncertain. Those boundary cases are where jailbreaks live.
The alignment problem is not that we do not know what we want AI to do. It is that specifying what we want precisely enough that a statistical system trained on human text will reliably do exactly that — and nothing else — across all possible inputs, including inputs specifically designed to find the gaps in our specification, is a genuinely hard engineering problem that has not been solved. Fable 5’s recall is the most expensive proof of concept of that fact in history.Neal Lloyd · Inside The Machine, Day 19
Constitutional AI and the Architecture Behind Fable 5
Anthropic’s approach to the alignment problem is called Constitutional AI. The core idea is this: rather than relying solely on human feedback to shape model behaviour — which is slow, expensive, and inconsistent — you give the model a set of principles (a constitution) and train it to evaluate its own outputs against those principles. The model learns to critique and revise its responses using the constitution as a guide. The resulting system is trained to be helpful, harmless, and honest — Anthropic’s three-part objective — with the harmlessness and honesty goals embedded in the training process itself rather than applied as an external filter.
Fable 5 is the most advanced implementation of this approach ever deployed publicly. The 120,000-character system prompt published on GitHub today gives an unprecedented window into how Anthropic operationalised the constitutional approach in its most powerful public model: which principles take priority when they conflict, how the model is instructed to reason about edge cases, what it should do when a request falls into a grey area between helpful and harmful. The prompt is the closest thing to a public constitution for a frontier AI model that has ever been published, and it is published because the model was recalled.
The irony of the Fable 5 recall is that it targeted exactly the model that Anthropic designed most carefully to be safe. Fable 5 is not the raw Mythos 5 capability — it is Mythos 5 with Anthropic’s most sophisticated safety architecture applied, with classifier routing for sensitive domains, with constitutional training, with a 120,000-character instruction set governing its behaviour. If Fable 5 has a jailbreak, it is not because Anthropic did not try to prevent it. It is because preventing all jailbreaks in the most capable AI system ever built is a problem that nobody has solved — not Anthropic, not OpenAI, not Google, not anyone working on frontier models today.
The Fable 5 Recall Is a Turning Point. Here Is Why.
Governments now have a demonstrated playbook. The US government has shown that export control authority can be used to recall frontier AI models on hours’ notice. This is not a theoretical power. It has been exercised. Every government that has been watching this episode — and they have all been watching — now has a concrete precedent for how to intervene in AI deployment when they judge it necessary. The EU, which has been developing its own regulatory framework, has the AI Act. China has its own AI governance apparatus. The Fable 5 recall demonstrates that Western governments are willing to use hard regulatory power, not just soft guidance, when they perceive AI capability as a national security concern.
The self-regulation argument is damaged. Anthropic’s entire brand proposition is built on the argument that responsible AI development, conducted with rigorous safety practices by an organisation that takes safety seriously, is the best way to ensure AI benefits humanity. The Fable 5 recall is not a refutation of that argument — Anthropic’s safety architecture is genuine and sophisticated. But it is a complication. The government did not trust Anthropic’s safety assessment of Fable 5. It did not ask for Anthropic’s evidence. It issued a directive. The message that the government’s response sends — whatever the specific merits of the Fable 5 case — is that self-regulation is not sufficient and that government retains the authority to override company safety assessments when national security is invoked.
Pre-deployment government review is now on the table. The implicit question in the Fable 5 episode — the question that Anthropic’s public statement was partly designed to preempt — is whether frontier AI models should require government review before public deployment. Anthropic’s position is that this standard would essentially halt all new model deployments. The administration’s position is that it gave Anthropic the option to fix the issue and the company declined. The resolution of that disagreement will shape the regulatory framework for frontier AI for the next decade. It is being negotiated right now, partly through public statements, partly through legal processes, and partly through the Anthropic IPO prospectus that is being drafted as this episode is published.
The AI safety debate has been conducted primarily in academic papers, conference talks, and policy white papers. This week it became a news story. A government recalled a frontier model. A CEO refused a government demand. A top scientist resigned. A 120,000-character system prompt is on GitHub. The theory of AI alignment has not changed. The urgency of solving it has.Neal Lloyd · Inside The Machine, Day 19
Inside The Machine, Day 19 · June 16 2026
Neal Lloyd writes about technology, human adaptation, and the uncomfortable questions nobody wants to answer at dinner. Inside The Machine is his ongoing daily series on AI.
- Day 01What Is This Thing?
- Day 02Survive the Machine
- Day 03The Great Debate
- Day 04Who Gets Hurt?
- Day 05Who’s In Charge?
- Day 06The Industries That Win
- Day 07The Human Edge
- Day 08The Creativity Question
- Day 09Does AI Feel Anything?
- Day 10The Data Problem
- Day 11The Trust Question
- Day 12The Accountability Gap
- Day 13The Rewired Brain
- Day 14Open vs Closed
- Day 15The New Cold War
- Day 16Why AI Lies With Confidence
- Day 17AI Is Eating the Power Grid
- Day 18The Age of AI Agents
- Day 19AI Safety Was Never Just TheoryYou are here



