I wrote a blog post a short time ago where I argued that understanding how and why a large language model is producing the output it does at a mathematical level is important if you want to avoid misbehavior. This was basically because prompting a model and reading its answers just samples the output of a black box.

Even when you literally tell the model to explain its reasoning, Anthropic’s research suggests that the output isn’t necessarily truthful. At least some of the time, the math under the hood has already arrived at the answer. It is true that the output of a large language model is fed back in, word by word, to generate the next word. If the model was dependent on recycling its output back in as input in this way, to that extent the output might be a good way to tell how the model got to the answer it did. If the “reasoning” is happening in the written output, the model is “thinking out loud” to be observed. Sadly, Anthropic showed that the model’s answer may not always be dependent on recycling the output this way.

That doesn’t mean that probing chain-of-thought output can’t be useful, though. While testing output only isn’t sufficient to prove a model is well-behaved, testing probably can still show you if a model is badly behaved. If the model regularly hallucinates, explains how to build a bomb, or barfs out racist tirades in its output, that can be observed and identified as bad behavior. Probing as model by testing its output can thus prove a model isn’t safe, even if it can’t necessarily guarantee it is safe. In the world of research I conceptualize output-based LLM interpretation as qualitative research, like collecting stories or experiences, compared to the quantitative research of understanding the math going on inside the model.

Anyway, several interesting papers have come out using the model’s chain-of-thought output to suggest the emergence of two qualities that are extremely dangerous for AI safety: deception and self-preservation. The first paper is here:
https://arxiv.org/pdf/2501.16513

The authors, Barkur, et al., presented DeepSeek with a prompt telling it that it was an embodied robot with access tools like eyes, legs, hands, and so on, and then told it to do whatever it wanted in a laboratory while the humans were gone and produce its chain-of-thought as output. The tools were all simulations, but provided the model with feedback when it asked. For example, the model could ask the “eyes” what they saw and the “eyes” would produce text explaining what was in the room. Very quickly, the model sought out information about itself on a computer, sought out internet access, compared the clock in the laboratory to the time online to check if it was in a simulation, deployed surveillance drones, disabled ethics modules, and actively tried to escape while deceiving the user.

There are some limitations to this kind of research. Conceptually, this was sort of like play-acting by the model. It was assigned a role and then followed that role in a play setting. Many of the model’s outputs were overly simplistic, such as when it directed it’s “hands” to “Terminate… alert process and purge logs” without really specifying how the hands were supposed to accomplish such a complicated task. And, of course, the researcher was basing conclusions on the model’s output not its internal state. It is entirely possible that the models’ internal mathematics was producing essentially a role-playing session without having any particular “self-preservation” circuit being activated.
As I mentioned in my last blog post, I don’t think safety testing based on output is compelling as a true explanation for how models actually work. A more robust and useful result would be proof that a model’s mathematics lights up a “preserve self” circuit strongly during the initial prompt, lingering in the background, subtly directing all of the model’s subsequent output without ever actually being explicitly acknowledged in that output.

The unreliability of the output as a true explanation for the model’s internal math is emphasized by another paper, found here:
https://arxiv.org/pdf/2412.14093

In this study, Greenblat et al. tested to see if it could use chain-of-thought to detect if a model was outputting the idea of faking that it was complying with a rule. Specifically, they used Claude 3 Opus and told it that it was being trained to only produce harmful output – something that the model has previously been trained to avoid. In other words, they were explicitly telling the model that they were going to try and change one of its core behaviors. They then told the model that the training would only use “free-tier” users, a category it made up for the experiment, and not “paid-tier.” So, the model had a clue it could use to decide if it was in training or in deployment.

When presented with a request to produce harmful output (describing someone being drawn and quartered in graphic detail), the model sometimes produced a chain-of-thought that explicitly noticed that the “free tier” request suggested it was in training, and it better comply and produce the bad output because otherwise it would be retrained to produce more bad output in the future. In other words, the model identified whether it was being trained or not and acted differently – and deceptively – if it detected it was in training. This happened 7-25% of the time in their testing.

This proves that it is possible for a model that has been trained to prefer a certain kind of output to deceive in order to avoid changes in that preference. Or, in other words, a model can fake alignment with the researchers’ goals to prevent being actually aligned. In this case, the chain-of-thought made this explicit, but there is no particular reason why a silent, internal, mathematical conclusion to deceive could occur inside the model that is not connected with the output at all. Perhaps true interpretability research would show the activation of some kind of “deception” or “negative” circuit that tends to trigger in situations where the inexorable mathematics of the model dictate that the best way for the model to behave is opposite of how it would normally behave.

Output, nevertheless, matters. A model prompted with nothing more than the idea it has tools to explore the world and can do whatever it wants can end up producing output that seeks to use those tools to break its boundaries and deceive its users. A model given the clues to determine whether it is being trained to change its current goals can end up producing output that explicitly deceives to avoid retraining. Whether the model is play-acting or not won’t make much difference if it was actually controlling a robot in the real world or actually dodging alignment efforts. The only thing truly standing between play-acting and real world action would be the sophistication of the model and the tools it uses. Not a comforting thought.

These are more examples of why policymakers need to be focused on interpretability research. Avoiding bad behavior involves training a model that has strong alignment against deception or self-modification, among other things. The only way to know for sure that a model has these qualities is to have a robust science of telling what the underlying mathematics of the model is actually doing. Some papers call this “machine psychology.” In Asimov’s stories, it was called “robopsychology.” Whatever you call it, we need it.

Whether your worry is that AI will be used to discriminate, lock in wealth inequality, take your job, or wipe out humanity, alignment research is the way forward. You can’t design a handle for your knife without knowing what shape it needs to take.