Safety from Honesty in a Disinterested AI Predictor

View PDF

Yoshua Bengio^1,2,3, Oliver Richardson^1,2,3, Tomáš Gavenčiak^6,7, Michael Cohen⁴, Rory Svarc⁶, Damiano Fornasiere^1,3, Gaël Gendron¹, David Hyland⁸, Aton Kamanda¹, Adam Oberman^1,5, Francis Rhys Ward¹, Anna Gavenčiak⁶, Jacob Livingston Slosser^6,9, Vincent Mai¹, Iulian Serban¹, Joumana Ghosn¹

¹LawZero, ²Universit´e de Montréal, ³Mila, ⁴University of California, Berkeley, ⁵McGill University, ⁶Arb Research, ⁷Center for Theoretical Study, Charles University in Prague, ⁸University of Oxford, ⁹Sapien Institute

02 07 2026

Introduction.

Advances in AI could accelerate scientific discovery, improve decision-making under uncertainty, as well as help manage complex systems in domains as varied as medicine and public policy. As noted in the International AI Safety Report [Bengio et al., 2025a], realizing these benefits at scale requires managing and mitigating the associated risks, especially in settings where failures are costly, hard to detect in advance, or amplified by widespread deployment. One central and well-documented worry is that, as AI systems become more capable, they will produce outputs that systematically steer decisions toward outcomes that differ from what developers and users intend—this is what is called AI misalignment.

Theoretical arguments suggest that instrumental subgoals such as self-preservation and power-seeking are near-universal consequences of goal-directed optimization [Omohundro, 2018, Bostrom, 2012, Russell, 2019, Zhuang and Hadfield-Menell, 2020, Turner et al., 2021, Cohen et al., 2022, 2024, Bengio et al., 2025b], where a sufficiently capable system will tend to acquire them regardless of what its terminal goal is, simply because such subgoals are useful for almost any objective. AI systems based on LLMs trained for next-token prediction may learn to imitate human drives—such as self-preservation—in ways that are implicit and uncontrolled [Ngo et al., 2022]. LLM sycophancy [Sharma et al., 2023] provides a contemporary example of AI misalignment, which can be harmful to psychologically vulnerable users who receive unwarranted 2 validation of dangerous beliefs or plans [Cheng et al., 2026]. Documented empirical evidence of deceptive and self-preservation behaviors has also been mounting [Bengio et al., 2025a, 2026, Greenblatt et al., 2024, Meinke et al., 2024, Anthropic, 2025, Betley et al., 2025], including new abilities to detect when a system is being evaluated (or, possibly, even trained [Fornasiere et al., 2026]) and to adjust behavior accordingly [Abdelnabi and Salem, 2025], as well as to resist shutdown when faced with incomplete objectives [Schlatter et al., 2026]. Misalignment may also be involved in cases of misuse where a malicious user manages to obtain dangerous knowledge from the AI in spite of its safety training and guardrails, including, e.g., for dangerous cyberattacks [Bengio et al., 2026]. Given that the risks from misalignment grow as AI capabilities continue to advance [Omohundro, 2018, Zhuang and Hadfield-Menell, 2020, Cohen et al., 2022], AIs with human-like drives and superhuman capabilities could be dangerous [Russell, 2019, Bengio et al., 2024, Amodei et al., 2016].

We posit that the root cause of such worries stems from learned implicit agency: goal-directed behavior that was not explicitly specified by the AI designers and may not even be detectable through the system’s stated outputs. Pretraining to imitate human constructions plausibly leads to imitating human drives. This problem is further amplified by post-training techniques like reinforcement learning from human feedback [Christiano et al., 2017, Ouyang et al., 2022], which explicitly reward outputs for their downstream effects on evaluator preferences. Together, these factors create a selection pressure toward outputs which implicitly steer the world rather than simply providing honest responses to user queries.

We propose Scientist AI (SAI) as a potential solution to such concerns. Underlying the SAI approach is a conceptual distinction between (i) honestly predicting the behavior of agents and the consequences of actions, and (ii) being an agent that makes (potentially dishonest) predictions in order to influence outcomes. An honest Predictor models others’ planning, deception, and instrumental behavior and forecasts the downstream effects of its own deployed outputs—these are predictions about the world, not choices made to bring about a preferred outcome [Li et al., 2024]. The Scientist AI design [Bengio et al., 2025b] aims to achieve (i) while avoiding (ii) by training a non-agentic Predictor to approximate the Bayesian posterior over Boolean natural-language statements via a consequence-invariant training process—that is, a training process aimed solely at removing indefensible epistemic inconsistencies [Richardson, 2022, 2024] and avoids any on-policy feedback loop whereby the model’s own deployed outputs are used to generate training gradients and deployment outcomes shape the selection of future Predictors. We call such a Predictor disinterested: it has no stake in which outcomes its predictions bring about, and this disinterest is what consequence-invariant training is designed to secure. Historically, work on AI oracles attempts to emulate such a disinterested oracle by containing a fully-formed superintelligence [Babcock et al., 2017, Alfonseca et al., 2016, Armstrong et al., 2012]. We take inspiration from the counter-factual disconnection approach of Babcock et al. [2017], but, more fundamentally, our approach is to train the AI within the box, so that it is highly unlikely to develop preferences at all. Developing an honest, non-agentic Predictor would be very useful [Bostrom, 2012, Bengio et al., 2025b]—for forecasting, hypothesis generation, and scientific work, and could serve as a guardrail inside safer agentic systems. In our approach, any required agency (e.g., for creative thought) is placed in explicit, auditable scaffolding code that is gated by the non-agentic Predictor.

We develop two semi-formal arguments that hold independently—about the accuracy of the Predictor’s predictions and the safety of its deployed outputs—both resting on honesty as an approximation of the Bayesian posterior over contextualized statements. The remainder of this section introduces the SAI pipeline layout and defines the scope of our guarantees. Section 2 frames the accuracy and safety arguments informally, followed by formal treatments in Sections 3 and 5.

For more details, see the attached PDF.

References

Sahar Abdelnabi and Ahmed Salem. The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025. URL https://arxiv.org/abs/2505.14617.

Manuel Alfonseca, Manuel Cebrian, Antonio Fernández Anta, Lorenzo Coviello, Andrés Abeliuk, and Iyad Rahwan. Superintelligence cannot be contained: Lessons from computability theory. CoRR, abs/1607.00913, 2016. URL http://arxiv.org/abs/1607.00913.

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.

Anthropic. Agentic misalignment: How LLMs could be insider threats. arXiv preprint arXiv:2510.05179, June 2025. URL https://www.anthropic.com/research/agentic-misalignment. Detailed report on simulated blackmail and self-preservation behaviors in Claude 4.

Stuart Armstrong, Anders Sandberg, and Nick Bostrom. Thinking inside the box: Controlling and using an oracle ai. Minds and Machines, 22:299–324, 2012.

James Babcock, Janos Kramár, and Roman V. Yampolskiy. Guidelines for artificial intelligence containment. CoRR, abs/1707.08476, 2017. URL http://arxiv.org/abs/1707.08476.

Alexander Balke and Judea Pearl. Probabilistic evaluation of counterfactual queries. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI’94, pages 230–237, Seattle, Washington, USA, 1994. AAAI Press. URL https://cdn.aaai.org/AAAI/1994/AAAI94-035.pdf.

Sander Beckers and Joseph Y. Halpern. Abstracting causal models, 2019. URL https://arxiv.org/abs/1812.03789.

Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1):151–175, 2010.

Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Trevor Darrell, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, et al. Managing extreme ai risks amid rapid progress. Science, 384(6698): 842–845, 2024.

Yoshua Bengio, Stephen Clare, Carina Prunkl, Shalaleh Rismani, Maksym Andriushchenko, Ben Bucknall, Philip Fox, Tiancheng Hu, Cameron Jones, Sam Manning, et al. International ai safety report 2025: First key update: Capabilities and risk implications. arXiv preprint arXiv:2510.13653, 2025a.

Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Oberman, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, and David Williams-King. Superintelligent agents pose catastrophic risks: Can scientist ai offer a safer path? arXiv preprint arXiv:2502.15657, 2025b.

Yoshua Bengio, Stephen Clare, Carina Prunkl, Maksym Andriushchenko, Ben Bucknall, Malcolm Murray, et al. International AI safety report 2026. Technical report, UK Government, 2026. URL https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026.

Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms. arXiv preprint arXiv:2502.17424, 2025.

Tineke Blom, Stephan Bongers, and Joris M. Mooij. Beyond structural causal models: Causal constraints models, 2019. URL https://arxiv.org/abs/1805.06539.

Nick Bostrom. The superintelligent will: Motivation and instrumental rationality in advanced artificial agents. Minds and Machines, 22(2):71–85, 2012.

David Bourget and David J. Chalmers. Philosophers on philosophy: The 2020 philpapers survey. Philosophers’ Imprint, 23(11), 2023. doi: 10.3998/phimp.2109. URL https://doi.org/10.3998/phimp.2109.

Myra Cheng et al. Sycophantic ai decreases prosocial intentions and promotes dependence. Science, 391:eaec8352, 2026. doi: 10.1126/science.aec8352.

Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Technical report, Alignment Research Center, December 2021.

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.

Michael Cohen, Marcus Hutter, and Michael Osborne. Advanced artificial agents intervene in the provision of reward. AI magazine, 43(3):282–293, 2022.

Michael K Cohen and Marcus Hutter. Imitation learning is probably existentially safe. AI Magazine, 46(4):e70040, 2025.

Michael K Cohen, Noam Kolt, Yoshua Bengio, Gillian K Hadfield, and Stuart Russell. Regulating advanced artificial agents. Science, 384(6691):36–38, 2024.

Tristan Deleu, António Góis, Chris Emezue, Mansi Rankawat, Simon Lacoste-Julien, Stefan Bauer, and Yoshua Bengio. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pages 518–528. PMLR, 2022.

Abram Demski and Scott Garrabrant. Embedded agency, 2020. URL https://arxiv.org/abs/1902.09469.

Damiano Fornasiere, Mirko Bronzi, Spencer Kitts, Alessandro Palmas, Yoshua Bengio, and Oliver Richardson. Language models recognize dropout and gaussian noise applied to their activations, 2026. URL https: //arxiv.org/abs/2604.17465.

Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, and Thomas Icard. Causal abstraction: A theoretical foundation for mechanistic interpretability, 2025. URL https://arxiv.org/abs/2301.04709.

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, S¨oren Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models. ArXiv preprint, 2412.14093, 2024. URL https://arxiv.org/abs/2412.14093.

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. Association for Computing Machinery, 2023. doi: 10.1145/3605764.3623985. URL https://doi.org/10.1145/3605764.3623985.

Edward J Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, and Nikolay Malkin. Amortizing intractable inference in large language models. In Proc. International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Ouj6p4ca60.

Evan Hubinger, Chris van Merwijk, Vladímir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019.

Daniel Kahneman. Thinking, fast and slow. macmillan, 2011.

Margaret Li, Weijia Shi, Artidoro Pagnoni, Peter West, and Ari Holtzman. Predicting vs. acting: A trade-off between world modeling & agent modeling. arXiv preprint arXiv:2407.02446, 2024.

Alexander Meinke, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. Frontier models are capable of in-context scheming. ArXiv preprint, 2412.04984, 2024. URL https://arxiv.org/abs/2412.04984.

Alexander Meulemans, Rajai Nasser, Maciej Wo lczyk, Marissa A. Weis, Seijin Kobayashi, Blake Richards, Guillaume Lajoie, Angelika Steger, Marcus Hutter, James Manyika, Rif A. Saurous, João Sacramento, and Blaise Agüera y Arcas. Embedded universal predictive intelligence: a coherent framework for multi-agent learning, 2025. URL https://arxiv.org/abs/2511.22226.

Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024.

Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.

Stephen M Omohundro. The basic ai drives. In Artificial intelligence safety and security, pages 47–55. Chapman and Hall/CRC, 2018.

Pedro A Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control. arXiv preprint arXiv:2110.10819, 2021.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744, 2022.

Judea Pearl. Causality. Cambridge university press, 2009.

Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020.

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models, 2022. URL https://arxiv.org/abs/2211.09527 .

Joaquin Quiñonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. MIT Press, 2008.

Oliver E Richardson. Loss as the inconsistency of a probabilistic dependency graph: Choose your model, not your loss function, 2022. URL https://arxiv.org/abs/2202.11862.

Oliver E Richardson, Spencer Peters, and Joseph Y Halpern. Qualitative mechanism independence, 2025. URL https://arxiv.org/abs/2501.15488.

Oliver Ethan Richardson. A Unified Theory of Probabilistic Modeling, Dependence, and Inconsistency. Cornell University, 2024.

Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology, 66(5):688, 1974.

Stuart Russell. Human compatible: AI and the problem of control. Penguin Uk, 2019.

Jeremy Schlatter, Benjamin Weinstein-Raun, and Jeffrey Ladish. Incomplete tasks induce shutdown resistance in some frontier llms. Transactions on Machine Learning Research, 2026.

Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 459–466, 2012.

Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021.

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023.

Thomas R Shultz, Jamie M Wise, and Ardavan Salehi Nobandegani. Text understanding in gpt-4 vs humans. arXiv preprint arXiv:2403.17196, 2024.

Alan M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, 42(2):230–265, 1936.

Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power. In Advances in Neural Information Processing Systems, volume 34, pages 23063–23074, 2021.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.

Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Self-taught reasoner. In Proceedings of the NIPS, volume 22, 2022.

Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tram`er, and Daphne Ippolito. Persistent pre-training poisoning of llms, 2024. URL https://arxiv.org/abs/2410.13722.

Simon Zhuang and Dylan Hadfield-Menell. Consequences of misaligned ai. Advances in Neural Information Processing Systems, 33:15763–15773, 2020.

Safety from Honesty in a Disinterested AI Predictor

Introduction.

References

Interested in joining our team?

Get in touch