An AI that Predicts but has no Hidden Agenda: LawZero Lays out a Formal Safety Case for its “Scientist AI”.

02 07 2026

Montréal, Quebec, Canada, July 2, 2026 — LawZero, a nonprofit dedicated to safe-by-design artificial intelligence (AI), today released a paper providing a new mathematical framework, representing a fundamental shift in the development of safe AI: one built to make honest predictions about the world without pursuing goals of its own. Titled: “Safety from Honesty in a Disinterested AI Predictor" and authored by a team led by Yoshua Bengio, the work tackles what many researchers now consider a major danger of increasingly capable AI: that systems trained to imitate people and optimize for outcomes can quietly become goal-directed in ways their designers never intended.

Today's most powerful AI systems learn first by imitating vast amounts of human-written text, and then by being rewarded for the answers approved by the users. The paper argues that this training recipe can inadvertently incentivize systems to pursue unwanted goals of their own — arising either from imitating human drives or wanting to maximize approval. This structural pressure can manifest as harmless-seeming flattery, or scale into highly critical safety risks such as deception or resistance to being shut down. The authors call this "implicit agency": goal-seeking that no one asked for and that may not even be visible in the system's stated answers.

“Most AI today is trained to act like us, to imitate, to please”, said Yoshua Bengio, Co-President and Scientific Director at LawZero. “We're building something different: a system that mechanically applies the scientific method for hypothesizing and predicting, trying to understand the world and report its beliefs honestly, including about what might harm us. Such a disinterested, scientist-like AI observes and analyzes rather than having hidden drives that can lead to scheming”, Bengio concluded.

A scientist, not an agent.

The proposed alternative is to build AI that behaves like a scientist reporting their best explanatory theories rather than act like an agent. A scientist tries to understand and predict the world accurately; an agent tries to change it to get what it wants. LawZero’s "Scientist AI" predictor is trained only to estimate the probability of events through the most broadly explanatory hypotheses, and is given no incentive to influence what happens next as a consequence of its predictions, a property called consequence invariance. The paper calls this Scientist AI system disinterested; it has no stake in the outcomes its predictions bring about.

Two design choices do the work.
- First, the system is taught to distinguish "someone claimed X is true" from "X is true" so it can learn from human text, by trying to explain it rather than imitate it, i.e., without absorbing human goals and biases as if they were established facts.
- Second, and this is the heart of the safety case, the training process never rewards the system for the real-world consequences of its answers, only for the explanatory power of its hypotheses, avoiding the feedback loop that would otherwise teach it to manipulate. When the broader system needs to take actions, such as searching or using tools that work is handled by separate, auditable code with a safety guardrail that withholds any answer it judges to be too risky.

Why accuracy and safety reinforce each other.

The paper's central result is a mathematical argument that, under clearly stated conditions, the chance of training such a system into a dangerous one is extremely small. Causing serious harm would require the system to be dishonest in a coordinated, sustained way across many separate answers — yet the training method provides no push toward that, and the objective directly penalizes the kind of miscalibration it would demand. The striking conclusion: accuracy and safety reinforce one another. The very honesty that makes the system useful is also what makes deception extremely unlikely, meaning there is no trade-off between accuracy and safety.

“The safety provided by the Scientist AI and its honest predictions makes it the ideal solution for monitoring and guard-railing frontier AI systems,” explained Iulian Serban, Senior Director, Research & Development at LawZero. “By analyzing the actions, responses and history of other AI systems, the Scientist AI will more accurately and honestly evaluate whether their actions and responses may cause harm and, if so, block them.”

In addition to deploying the Scientist AI as a safety guardrail, LawZero expects the Scientist AI to act as a research acceleration tool providing hypothesis generation and probabilistic reasoning capabilities, helping researchers make new discoveries across fields ranging from medicine and climate change to cybersecurity and AI safety itself.

The authors are careful about scope, however: the paper’s argument addresses one specific risk ― the predictor itself developing hidden goals ― and is a formal case resting on assumptions, not an absolute guarantee. It does not by itself cover deliberate human misuse, one-off honest mistakes or the safety of larger, more capable agentic systems built on top of the predictor. However, agentic extensions are precisely the directions of current research at LawZero. The team presents the work as a foundation for safer AI and proposes concrete experiments to test its assumptions empirically.

Read the complete case. The full argument, including the formal proofs, the consequence-invariance result, and the experiments LawZero proposes to test it, is available here.

An AI that Predicts but has no Hidden Agenda: LawZero Lays out a Formal Safety Case for its “Scientist AI”.

Interested in joining our team?

Get in touch