
In a compelling essay titled The Urgency of Interpretability, Anthropic CEO Dario Amodei raises a pressing concern: the world’s most advanced AI models remain a mystery, even to those who build them. Despite rapid advances, researchers often have no clear understanding of how these systems generate their outputs. To address this, Amodei has proposed an ambitious goal to reliably identify most major AI problems by 2027.
Although Anthropic has made early progress in tracing how models arrive at conclusions, Amodei stresses that this is just the beginning. He warns that deploying powerful models without interpretability poses a serious risk. According to him, these systems are poised to become central to global infrastructure affecting economies, national security, and technology yet remain largely opaque.
The Case for Interpretability
Anthropic is a key player in mechanistic interpretability, a field focused on uncovering the decision-making processes within AI systems. While performance continues to climb, the inner workings of these models remain elusive. This is especially concerning as AI capabilities approach artificial general intelligence (AGI).
For instance, OpenAI’s new o3 and o4-mini models show better task performance but also hallucinate more often. Crucially, no one knows why. Amodei emphasizes this gap, noting that AI models can summarize financial reports or answer complex questions yet we don’t know why they choose certain words or make errors.
Amodei argues that this lack of transparency is unacceptable, particularly given the autonomy these systems are gaining. His co-founder, Chris Olah, has compared AI development to cultivation rather than construction they’re “grown more than built,” implying that improvements are happening without full understanding.
What’s Next for AI Safety
Looking forward, Anthropic wants to develop tools akin to brain scans or MRIs for AI models. These diagnostics could reveal if a model is lying, manipulating, or seeking power. Amodei admits this will take five to ten years, but insists it is vital for safely deploying future models.
The company recently discovered “circuits” in its models pathways that help them process information, such as identifying which U.S. cities belong to which states. Although promising, these breakthroughs are just a glimpse. Anthropic believes millions more circuits remain hidden within these systems.
Beyond internal research, Anthropic is also investing in startups working on AI interpretability. Amodei believes that being able to explain model decisions could eventually be a commercial advantage, not just a safety measure.
To accelerate progress, he has urged fellow tech giants like OpenAI and Google DeepMind to prioritize interpretability. He also advocates for light government regulations that would require companies to disclose their safety practices. In addition, Amodei supports restricting chip exports to China to reduce the risk of an unchecked global AI race.
In contrast to rivals, Anthropic supported California’s controversial AI safety bill, SB 1047, even as others pushed back. Through this and his latest essay, Amodei positions Anthropic not just as a leader in AI, but as a cautious steward of its future.