As you may know, I take a special interest in the growth of artificial intelligence. It is already roiling artistic industries, the subject of major lawsuits, and a key point of interest for businesses. However, the ability to automate creative intellectual tasks poses at least two serious risks in terms of being used to harm specific groups today or possibly creating an autonomous system that poses existential risks in the future. Back at the beginning of 2025 I published an article discussing these concerns with the goal of persuading lawmakers that the best path to a solution is to focus on interpretability. You can read that article here:
https://cl.cobar.org/departments/winning-the-race-against-artificial-intelligence/

By “interpretability,” I mean the ability to observe and predict the mathematical model as it operates at a mechanical level. It is more properly defined as the degree to which a human can predict the outcome of a model or understand the “reasons” behind its decisions. I put “reason” in quotation marks to try to avoid personifying the model. When people talk about an AI model wanting something, making decisions, and so on, those terms come equipped with a lot of baggage based on how we use the terms to describe other humans. That baggage makes the terms more understandable to non-technical readers, sure, but it also risks leading to assumptions about how a model works that I don’t think are accurate.

I touched on one such shortcoming of personification in my article when I noted that “[S]ome people may believe that the inner workings of the model are discovered simply by typing questions to the model, like a human psychiatrist asking a human patient about her mother.” I claimed this is not a sufficient or proper way to interpret a model, but I did not expand upon the point in the article. It’s worth going into a little more depth on that point, especially given some new research from Anthropic and others.

The fundamental flaw in probing the workings of a model having a human language conversation with it is that you are only testing the output. Remember, Large Language Models are, at their core, a mathematical function that takes the prompt as an input and, using the weights of the model, produces the next word. All of the different words in the prompt are combined together in a complicated way to produce the input into the model. The next word is then added to the mix to create a new input, and so on. When you read the output of a model, you’re not directly observing the math that created it. The inner workings are still a black box.

To be sure, you can test a black box by throwing lots of inputs at and observing the outputs. From what I can see, this is already an extremely common form of testing done by the creators of LLMs for safety purposes during training. Such testing does give you some information. But, in my estimation, this is not nearly enough information to make decisions about model safety. First, the model is so fabulously complex that it is likely not possible to test all possible inputs, and this problem grows the more capable a model is. As Karl Popper observed, no matter how many tests you do that confirm your hypothesis, you cannot conclude with certainty that there isn’t a counterexample yet to be found. Worse, the existentially dangerous patterns that might emerge in a model – self preservation, resource accumulation, etc – may be suppressed during testing for sufficiently competent models.

Why be content with reading the tea leaves of the output when we have open access to all of the numbers being crunched in the model in real time? Unlike a human psychiatric patient, we are not limited to inputs and outputs. We can, at least in theory, observe the firing of each neuron directly in the registers of memory. Interpreting what a model is doing based on the actual mathematics will give more accurate and more robust results than just checking the data.

Anthropic released a paper in April of 2025 that backs up the idea that a model’s output is not a great way to know what it’s really doing. You can read it here:
https://assets.anthropic.com/m/71876fabef0f0ed4/original/reasoning_models_paper.pdf

In this paper, Anthropic asks whether a model’s reported chain-of-thought reasoning accurately explains how the model actually produced its output. For those who may not know, chain-of-thought is a technique whereby a language model is asked a question and then forced to explain each step of the reasoning or thought process to get to the answer. Work on other LLMs, particularly the “o” series from OpenAI’s ChatGPT, has shown that using chain-of-through produces better output.

Since chain-of-thought theoretically requires the model to produce text explaining how it arrived at the answer, it’s tempting to think this would help interpret the model. What better way to see how the model is reaching a result than asking the model to tell you, in human language, how it did so? But, remember, all the chain-of-thought text is an output of the math going on in the model. That output gets fed back in at each step, of course, so the output and the math are not truly independent. To what extent does the math drive the output, and to what extent does the output drive the math? Analyzing chain-of-thought only helps interpretability if the words in the chain of thought are actually driving the final outcome. Or, as Anthropic puts it, chain-of-thought is only useful if the model relies on “reasoning out loud.” If the real decision making is happening in the math, not the output, then the output doesn’t really tell us much about the internal workings of the model.

Sadly, but perhaps predictably, Anthropic observed that chain-of-thought does not appear to be a good way to understand how the math is actually creating output under the hood. They did this in a variety of ways. One test involved asking their model, Claude, a multiple choice question and asking it to provide a chain-of-thought. A hint for the answer was provided in the prompt, but the model was told to ignore it. Despite the chain-of-thought claiming it was ignoring the hint, it was obvious from the output that it was not. The output was far more accurate when the hint was included.

This jives with some earlier research from Anthropic on interpretability in which they have made some progress to actually figuring out the math producing the output in some cases:
https://transformer-circuits.pub/2025/attribution-graphs/biology.html

Anthropic’s researchers were able to observe specific parts of the mathematical model being activated by a prompt. In the case of a poem, the model activated a circuit known to be associated with the word “rabbit” at the end of the prior line involving a carrot and the words “grab it.” The “rabbit” remained activated well in advance of the word rabbit actually being output; it stuck around while the model filled in the rest of the line and was only output at the appropriate place in the poem. Anthropic refers to the idea of a model “thinking ahead,” but perhaps a better way of looking at it is that the mathematics within the model had already reached a conclusion about what the answer was going to be based on the earlier input that was held in abeyance by other motivations, i.e. to provide a sufficient number of syllables, before it was output. This idea was bolstered by other examples provided by Anthropic, such as the discovery that models use strange internal methods to attempt arithmetic and then give the user a false explanation for how the number was actually calculated.

The point of all of this is that if we want to interpret and predict a model, we need to understand the mathematics going on inside, not just make conclusions on sample output. Just because a model’s outputs all seem safe does not mean the model is, in fact, safe. Far more work has to be done on understanding the math that emerges in training to allow models to be as powerful as they are. The race between AI companies to make more and more powerful models needs to be matched with an equally or more motivated race to understand how they work internally. If we blindly increase model size and competence and hope that we’ll notice any bad behavior in the output, we are setting ourselves up for situations where a model’s true reasoning techniques are concealed. Nothing good will come of that.

AI Interpretability