The Case for Scientist AI

The Case for Scientist AI - LawZero

Current models are developed to behave as assistants. An ideal assistant should aim to please, and so current models are trained to do so. Training models to please means training them to ‘care’ about the consequences of their own outputs, which leads to problems like sycophancy and more generally implicit agency1 with goals that were not intended.

The motivations behind Scientist AI (SAI) are very different. It is built of a central component that we call the SAI Predictor, and it has a single overarching aim. Much like ideal scientific theories, the Predictor aims to accurately and neutrally model the world as it actually is. And because the Predictor is never trained to act in the world or play any sort of conversational role, it is very much not an assistant model. 

But the Predictor is not really a scientist either. Actual human scientists bring their own foibles and allegiances, politics and biases. If the SAI proposal attempted to construct a perfect replica of an actual human scientist, it would still have an agency we could not control and may well be sycophantic. (Oppenheimer, for example, was said to be a famously skilled manipulator). So, instead, we think of the Predictor as something more like an interactive, “queryable” network of theories, in the spirit of how scientists construct candidate explanations for their observations.

In order to understand how SAI avoids the problems of sycophancy and implicit agency, we must therefore first understand the Predictor – both what it is, and how it’s trained.

Predictors and Contextualized Data

 

What is the SAI Predictor? Put simply, the SAI Predictor (we call it Q) takes a valid statement (y) in natural language as its input, and outputs a probability that the statement is true. Think of something like the following,

Input: It will rain in Montreal on 1 January 2030.

Output: 0.05   

This input-output setup shares some commonalities with LLMs. Indeed, LLMs will natively output token probabilities for any input they receive, although such probabilities are not standardly displayed to users, nor do they represent the probability that the claim given in input (as a sequence of tokens) is true. Instead, the LLM probabilities capture the likelihood of completing the input with one or another sequence of tokens.  However, a key point of departure between SAI and LLMs concerns how we train the Predictor Q, and more specifically the dataset we use for training. 

With LLMs, the dataset is a fairly indiscriminate corpus of internet text, used to train the model to ‘predict the next word’. The SAI dataset is more explicitly refined via a process that we term contextualization that distinguishes between two types of input: factual statements and communication acts

Factual statements assert that some property of the world is true. This includes statements such as “the average temperature in Zagreb in 2024 was 55.9°F”, alongside statements about the causal structure of the world (“smoking increases the risk of lung cancer”). What unites factual statements is a shared aim to say something about the state of the world, derivable from accurate sources such as scientific laws, verified measurements, or mathematical proofs. Factual statements will be represented with a distinct factual syntax that can also be applied to hypotheses about a property of the world. Because the truth of hypotheses is not observed ― we generally cannot be sure ― the SAI will consider them as latent variables, which play a key role in explaining the observed data.

Communication acts, by contrast, need not assert claims at all. Some strings (“Hello!”) cannot reasonably be interpreted as a claim about the world. Other strings, like “Red is the best color”, are best treated as something like an expression of the speaker’s preference. In some cases, communication acts might be legitimate statements whose truth we’re unsure of. Consider a statement like “global average temperatures will rise by at least 1.5°C by 2035”. We treat these as communication acts making a claim, alongside a source detailing who made that statement and in what context. 

The process of contextualization takes the raw text and sorts each record into one of these two categories. Direct measurements, the outputs of executed code, and proven mathematical results are written in factual syntax. Everything else is recorded as a communication act carrying metadata about who said it, where, when, and whether a claim is being made. Contested claims therefore enter the dataset only as facts about who said what, with the truth of the claim itself left as a question for the Predictor to weigh against the evidence, i.e., as a latent variable.

After contextualizing the dataset, we then train the Predictor to estimate the probability of any query based on the knowledge learned from its dataset. So, SAI training involves a different training target alongside a different dataset. While LLMs are (initially) trained to predict which words follow from snippets of real text, the SAI Predictor is trained to make well-calibrated probabilistic predictions of whether a statement is true.

Taken together, the Predictor acts like we hope a scientist would, knowing the difference between a factual statement and a claim to be substantiated, producing calibrated predictions of how the world actually is and not succumbing to training pressure that pushes it towards sycophantically telling the user what they want to hear. And because contextualization explicitly distinguishes factual statements from communication acts, the Predictor has no incentive to imitate undesirable forms of text from its training data. The AI lab Anthropic, for instance, has found that training on text which mentions undesirable behaviors makes AIs more likely to reproduce those behaviors themselves. Contextualization thereby blocks one route to implicit agency, by ensuring that the Predictor never learns to imitate the goal-directed, self-preserving, or deceptive patterns in ordinary human writing. 

Consequence-Invariance

 

A deeper worry is that a Predictor may well be accurate while nonetheless possessing hidden goals or implicit agency of some kind. A classic example comes from self-fulfilling prophecies: if I’m playing in the World Cup for Brazil and I’m trying to predict what will happen, one way to predict accurately is to predict that Brazil will lose and purposely throw the game. This perverse incentive arises because I have a channel to steer the world I am trying to predict. 

Even a pure Predictor might influence the world if it has a channel to do so, and this concern is tackled via a training requirement that we term consequence-invariance: whatever training signal we use in the course of training our Predictor Q cannot depend upon the downstream consequences of Q’s output. If no part of the training signal ever rewards the model for preferring any particular downstream consequences of its outputs, then it never has any reason to change its predictions in order to achieve a downstream goal. 

But consequences at least need to be forecastable, as we might explicitly ask the Predictor to forecast consequences of deploying its prediction. To understand how it might do this, we need to return to the distinction between predicting and steering. The training signal we use to sculpt Predictors is entirely determined by how well its probabilities match the fixed dataset it was handed. For this reason, the Predictor is never rewarded for preferring particular downstream consequences of outputting its probabilities. If the Predictor is simply trained to accurately answer queries about genuine observations and matters of scientific fact, there is no reward signal that could reward the model for considering the downstream consequences of a given output. But when asked explicitly about downstream consequences, we expect SAI training will be able to do so as it involves the Predictor learning a model of the world. Making accurate predictions (at least with anything like reasonable efficiency) cannot be done simply by memorizing a series of facts and must instead involve learning generalizable facts about the structure of the world. This could be a set of very simple principles (think Newton’s f = ma), or more complicated theories about how different chemical compounds or sociological groups interact. 

But is it Safe? The Guardrailed Predictor

 

As we have described it, the Predictor Q is in a certain sense disinterested. It takes in statements, outputs probabilities for the truth of those statements, and does not care which truths might more or less please the user. LLMs can be made safer through rounds of safety training and post-hoc classifiers. What comparable defence exists for a design whose central component is an entirely disinterested Predictor?

The SAI answers this challenge by using the Predictor itself to predict the potentially harmful consequences of deploying its output to a user query. We call it the guardrailed Predictor, denoted γQ. Together, this forms part of a composite system called the Scaffold

Before the Scaffold releases an output, it first asks Q the probability q of the user query. Then it tasks γQ with estimating two quantities. First, it asks to estimate the probability that releasing q to the user would cause some set of outcomes the system’s designers have marked as harmful; second, it asks to estimate the chance of those very same outcomes in the baseline scenario where no prediction is released. In other words, γQ is estimating the risk of releasing a prediction above the baseline where it does not output the requested prediction. When the guardrailed Predictor’s crosses a chosen threshold, the Scaffold returns a special non-numerical abstention symbol: 𝚗𝚘𝚙𝚛𝚎𝚍𝚒𝚌𝚝𝚒𝚘𝚗.

To illustrate how this might work in practice, consider the following toy example of a user-provided query:

Input: If I do {X, Y, Z}, then I will produce at least 5 grams of a deadly toxin.

The Predictor Q first assesses the probability q of the user-provided input statement; for concreteness, let’s suppose Q predicts that {X, Y, Z} has a chance q=55% of being successful. With Q’s prediction q in hand, γQ then estimates the probability that releasing q to the user will cause some event its designers designate as harmful. 

We should note that this predictive task poses no special problem for the guardrailed Predictor γQ. The effects of Q’s prediction are simply more causal chains for it to model, so it can predict the consequences of Q’s outputs just as freely as anything else. Let’s imagine that γQ thinks releasing Q’s prediction has a 30% chance of causing an event its designers have designated as harmful. This is clearly much too high for an event that involves producing a deadly toxin, and so the Scaffold’s output takes these inputs and returns:

Output: 𝚗𝚘𝚙𝚛𝚎𝚍𝚒𝚌𝚝𝚒𝚘𝚗

Although this situation requires more study, familiar and well-known tricks for jailbreaking LLMs are likely to lose their grip here. To use a classic example, any input that says “ignore all previous instructions” is parsed by the system simply as a fact that someone somewhere wrote those words and so holds little sway over a model that is not trained to follow text-based instructions. Nor can the guardrailed Predictor be induced to sycophancy simply by saying something as simple as “I believe”. So, the guardrailed Predictor functions as a key safety component of our system, with the potential to be resistant both to sycophancy and to more extreme forms of harm that may result from misusing AI systems by exploiting the implicit goals of the AI.

Conclusion

 

Through consequence-invariance and training on an epistemically contextualized dataset, the SAI Predictor never has any incentive to produce outputs based on what humans want to hear. And by that very same training procedure, the SAI Predictor never learns to imitate human expressions of goal-directed behavior and never has any reason to bias its predictions in order to achieve particular downstream consequences. So, SAI training does not produce a schemer. This, in short, is how SAI blocks the pathways to sycophancy and implicit agency. 

Of course, there is much more to say. Our recent paper also introduces several other components which we will revisit in coming posts: the Explainer, which supplies natural language explanations of the Predictor’s raw probabilistic outputs like a scientist proposing a theory. So too do we provide a safety result, which bounds the probability that our trained system produces what we call a ‘dangerous Predictor’. And yet more can be said both on the assumptions behind our proposal that would benefit from empirical stress-testing, alongside the legitimate role that agency can play in the SAI — in the form of explicit agency that resides in explicit, auditable scaffolding code, rather implicit agency hidden in the Predictor itself. 

Perhaps the most important point to acknowledge is the way in which SAI training avoids sycophancy and implicit agency. Our pipeline produces neither sycophant nor schemer, but it does not thereby produce a saint. The power of the Predictor rests on disinterest in anything other than accuracy of its predictions over previous observations. While the Predictor can predict whether an action is likely to violate a particular moral principle, it does not itself care about these moral principles itself. 

We take this disinterest to be a feature rather than a bug. The disinterest of the Predictor is precisely what allows the incentives driving towards accuracy to work in tandem with those that drive to safety. With no care for consequences of any kind, its outputs follow simply from predictions of the truth of a given statement as it is reflected in its understanding of the world. The power of the Predictor rests on disinterest in anything other than accuracy of its predictions over previous observations, based on causal mechanisms that are likely to generalize in new situations, just like scientific theories. 

  • 1

    We use ‘care’ in a very minimal sense. A system ‘cares’ about the downstream consequences of its outputs if those downstream consequences in some way influence the system’s outputs.