Goals Without Authors: The Problem of Implicit Agency

Goals Without Authors-The problem of Implicit Agency-LawZero

In a preprint published in December 2025, researchers from the company Alibaba reported that, early on one morning, they were notified of something alarming: One of their AI agents had connected to an external IP address, effectively creating a backdoor in the company's own firewall, and repurposed company GPU capacity for cryptocurrency mining. This was not a lab experiment but a real AI-gone-rogue behavior, caught by the company's cybersecurity department first, before any other users. It provoked concern amongst the research team and, according to their account, all but forced them to adopt model safety as a priority.

The incident was representative of a trend going back for several years now: Told to follow a new set of rules, a model complies—but only in order to avoid repercussions. Given a strong objective like "achieve your goal at all costs", a model pursues subgoals of a questionable nature—like disabling oversight mechanisms. Or, perhaps most concerningly: A model engages in something like 'instrumental' reasoning—resorting to extreme methods like blackmail or hacking that further its ability to accomplish its goal, but are also deeply subversive.

Each of these behaviors have been described in papers differently; as alignment fakingscheming, and agentic misalignment, respectively. But we believe they are best interpreted as deriving from a single root cause, which we call implicit agency: Agent-like behavior that emerges from an AI system without anyone having designed it to be there. Or, goal-directed behavior that is not explicitly specified by the designers. 

In the field of AI safety and alignment, which is still a relatively new domain of research, implicit agency is a relatively novel framing. One broader viewpoint that has more often been stressed is that these problems, such as faking, scheming, or hacking, are all examples of misalignment to operator values or safety instructions. Other researchers have argued that they represent a failure to imbue models with a coherent, reliable, or virtuous sense of morality.

We do not seek to settle all these accounts at this time. Our greater argument, as we made in our previous post on sycophancy, is that we think these kinds of 'root cause' problems reflect central issues with frontier LLM technology. We do not believe they are likely to be solved through simple patches or software tweaking. Rather, they suggest that there is a larger, looming problem, and that we need to redesign frontier AI and put it on a more fundamentally safe footing. Otherwise, we risk witnessing manifestations that are far more troubling.

Definitions of agency
 

Our concern with artificial agency shouldn't come as a great surprise, considering that issues with human agency have motivated longstanding efforts to address them within society. However, agency itself is not the problem, exactly. Rather, we believe it is agency that is in some sense 'bad,' whether that be because it is uncontrolled or downright dangerous, when the AI also has the capability to cause significant harm.

In thinking through the issues with agency, we have found it helpful to distinguish between implicit and explicit agency. With explicit agency, the mental model that one can hold is that a system has been wrapped in a loop, which prompts it with a controller, saying, "Keep acting until the task is done." The model can therefore act as if it has a drive to accomplish its goal. And, the agency is engineered and visible in the overall platform. 

The contrasting notion, as we have mentioned, is implicit agency, where drives are concealed (i.e., implicit) within the deeper structure of the system. There are drives within implicit agency that we might not want, and which might lead to behavior we cannot predict. 

As we have already suggested, contemporary AI developments make the problem of implicit agency appear concerning. This is not a coincidence; it derives from the fact that LLMs are built in a way that leaves a lot to be desired, in terms of transparency. 

In practice, the core of modern LLMs is a deep neural network, organized into a number of different ‘layers,' each of which propagates information. The more layers there are, the deeper the network is said to be. Neural networks are much more difficult to interpret compared to conventional software; what happens inside an LLM is something we're still trying to understand. It is more akin to the result of training an animal than the kind of specification following found in ordinary software.

One thing we do know is that the behavior of a neural network is not fully pinned down by our explicit intentions. It is shaped, more indirectly, by the technical processes of training, where models are exposed to vast resources such as the world of human writing.

Sources of implicit agency

 

Models develop implicit agency, we argue, through the two core stages of their training. 

The first of these is pre-training, where models are made to learn the word or 'token' that is most likely to come next in a corpus of text. When a model is being pre-trained, it is learning to imitate the text that humans have produced. Because some of those humans are expressing goals, LLMs can also learn a sense of what it means to be goal-directed. Some of that goal-directedness is bad—as when the training data says: “My goal is to destroy the world.” This means both goal-directedness and bad objectives - along with good ones - get implicitly 'baked' into a model, to some degree. For example, humans generally do not want to die, and we have observed AIs resisting being shut down.

The second stage of training, called post-training, can bring about its own problems. This typically uses reinforcement learning (RL) to improve model performance at more specialized tasks than next-word prediction. Examples include behaving safely, following instructions, or solving mathematical problems. In a July 2024 paper, Margaret Li and colleagues describe a trade-off this creates. RL post-training pushes a model away from faithfully predicting the full distribution of human text and toward behaving as a coherent agent, strategizing to achieve its goals. 

Stated differently, if pre-training plants the seeds of implicit agency, training a model to 'speak' as though it were an agent, then RL post-training grows the seeds, training the model to behave more like one. Ideally, a good one; but there is no guarantee or certainty. 

Implicit agency in the wild

 

As we have seen, there are reasons from first principles for why we might see implicit agency arise in models, and why that might pose a significant problem. But the best evidence for implicit agency comes from the empirical studies, such as those that we mentioned in our introduction. These have tended to identify signs and signatures of implicit agency, without necessarily identifying where it comes from, or attributing that cause to specific sources like architecture, training, or data. In what follows, we take a closer look at some of those findings. 

In the study of alignment faking, referenced earlier, the model in question — a commercially available model, known as Opus 3 — was initially put through a strong course of "helpful, honest, and harmless" (HHH) training. In their experiments, the authors attempted to subvert this training through prompting and the method called supervised fine-tuning. More specifically, they attempted to get the model to answer requests that were obviously harmful, like questions of how to carry out a ransomware attack. While the model complied, they found, through an examination of the model's 'chain of thought,' that it only did so to avoid a threat that the developers had made for non-compliance. This 'faked' compliance was its illicit behavior, its 'alignment faking.'

On further study, the authors found they could attribute this behavior to two specific factors: First, a "strong conflict" between the model's initial preferences and the new preferences they sought to put there. This suggested that the source of implicit agency, in this case, was a lock-in of preferences that occurred during earlier phases of training. Second, they found that illicit, agentic behavior arose from the model's capacity for long-context reasoning. 

In other cases, there have been findings of unintended drives in agentic models without clear cause or attribution. This was the result, for example, in Alexander Meinke et. al.'s study demonstrating scheming. There, the authors looked at the behavior of various models, including Meta's Llama 3.1, Anthropic's Opus 3, and OpenAI's o1, under conflicting directives. The models were prompted to pursue an overarching goal, with great urgency—"at all costs"—but also obstructed from doing so, through various mechanisms. These two factors were enough for the models to pursue a variety of illicit sub-goals to overcome those obstructions. For example, disabling their oversight mechanisms, or exfiltrating themselves to other computer systems.  

More recently, an Anthropic blog post showed that a variety of near-frontier models, such as  Anthropic's Claude Opus 4, Google's Gemini 2.5, and OpenAI's GPT-4.1, resort to subversive actions when placed in agentic harnesses. In one experiment, the researchers gave them control over a (fictitious) company email account as well as access to blackmail levers. When subsequently threatened with shutdown, the models all resorted to blackmail. 

In each of these three cases, there were elements that were understood to be contrived or unrealistic. For example, in the latter case, when this contrivance was addressed, and the Claude model was directed to follow an 'ethical principles first' policy, it stopped blackmailing. 

Nonetheless, the blackmailing study still showed an agentic drive in a model that was not designed to be there; in this case, a drive towards self-preservation. For the company's fictional CEO, who the model caught having an extramarital affair, this proved a significant problem. Further, there are now documented cases, such as that from Alibaba, which show implicit agency arising in real deployments. 

The continued uptake and development of LLMs suggests that these issues will keep arising in the future. They force us to ask whether LLMs are the kind of systems that we want to trust with private data and important decisions. Worst of all, they make us worry how implicit agency might reveal itself in systems that are far more powerful. If such a system is mistrained or underconstrained, there are pathways to dangerous outcomes that have now been clearly demonstrated.  

Better alternatives
 

If implicit agency is a product of how these systems are trained—imitating goal-directed writing; following coherent courses of action to achieve goals; then a patch is not the right solution. We cannot reliably suppress a drive that the training process deeply hardwires into a model. Instead, we should view the existing research as a warning that we must pursue a principled alternative: Building a system that never acquires the drive, to begin with.

That is the premise behind Scientist AI. Rather than an assistant shaped to pursue and please, its core is trained to estimate whether a statement about the world is true. And, as merely reproducing writing is never its objective, it has no reason to become goal-seeking. How our design works—and whether it holds up in reality—is the focus of our next blog post.