
Using neuroscience to help human knowledge contribute to AI safety
This article is part of an ongoing series looking at AI in KM, and KM in AI.
The recently published second edition of the International AI Safety Report1 finds that while there have been positive advances in AI safety, serious challenges remain, for example, the reliable pre-deployment safety testing of AI models has become harder to conduct.
The new arXiv pre-print paper2 “NeuroAI for AI Safety” proposes that human knowledge is an attractive model for AI safety. As the only known agents capable of general intelligence, humans perform robustly even under conditions that deviate significantly from prior experiences, explore the world safely, understand pragmatics, and can cooperate to meet their intrinsic goals. Intelligence, when coupled with cooperation and safety mechanisms, can drive sustained progress and well-being.
These properties are a function of the architecture of the brain and the learning algorithms it implements. Paper authors Mineault and colleagues therefore contend that neuroscience may hold important keys to technical AI safety that are currently underexplored and underutilized. In response, they highlight and critically evaluate several paths toward AI safety inspired by neuroscience.
Mineault and colleagues use the technical framework introduced by Deepmind in 2018 to identify three aspects of how studying the brain could positively impact AI safety:
- Robustness – specifying how an agent can safely respond to unexpected inputs. This includes perform-
ing well or failing gracefully when faced with adversarial and out-of-distribution inputs, and safely
exploring in unknown environments. This can also mean learning compositional representations that
generalize well out-of-distribution. Robustness further implies knowing what you do not know, by
maintaining a representation of uncertainty, to ensure safe and informed decision-making in novel or
uncertain scenarios. - Specification – specifying the expected behavior of an AI agent. A pithy way of expressing this is that
we want AI systems to “do what we mean, not what we say”. This includes correctly interpreting instructions specified in natural language despite ambiguity; preventing learning shortcuts that generalize poorly; ensuring that agents solve the real task at hand rather than engaging in reward
hacking (i.e. Goodhart’s law); and so on. - Assurance (or oversight): being able to verify that AI systems are working as intended. This includes
opening the black box of AI systems using interpretability methods; scalably overseeing the deployment
of AI systems and detecting unusual or unsafe behavior; or detecting and correcting for bias.
Each of Mineault and colleagues’ eight proposals for neuroscience for AI safety are listed in Table 1, along with which of the above aspects of AI safety they propose to affect.
Table 1: Proposals for how neuroscience can impact AI safety (source: Mineault et al., 2025).
| Proposed method | Summary of proposition | Safety aspect |
| Reverse-engineer sensory systems | Build models of sensory systems (“sensory digital twins”) which display robustness, reverse engineer them through mechanistic interpretability, and implement these systems in AI | Robustness |
| Build embodied digital twins | Build simulations of brains and bodies by training auto-regressive models on brain activity measurements and behavior, and embody them in virtual environments | Simulation |
| Build biophysically detailed models | Build detailed simulations of brains via measurements of connectomes (structure) and neural activity (function) | Simulation |
| Develop better cognitive architectures | Build better cognitive architectures by scaling up existing Bayesian models of cognition through advances in probabilistic programming and foundation models | Simulation, Assurance |
| Use brain data to finetune AI | Finetune AI systems through brain data; align the representational spaces of humans and machines to enable few-shot learning and better out-of-distribution generalization | Specification, Robustness |
| Build an evolutionary curriculum | Build safety guardrails in AI by recapitulating the natural evolutionary curriculum | Specification |
| Infer the brain’s loss functions | Learn the brain’s loss and reward functions through a combination of techniques including task-driven neural networks, inverse reinforcement learning, and phylogenetic approaches | Specification |
| Use neuroscience methods for interpretability | Leverage methods from neuroscience to open black-box AI sys-tems; bring methods from mechanistic interpretability back to neuroscience to enable a virtuous cycle | Assurance |
Article source: Mineault et al., 2025; CC BY 4.0.
Header image source: Gerd Altmann on Pixabay.
References:
- Bengio, Y., Clare, S., Prunkl, C., Andriushchenko, M., Bucknall, B., Murray, M., … & Mindermann, S. (2026). International AI Safety Report 2026. UK Government. ↩
- Mineault, P., Zanichelli, N., Peng, J. Z., Arkhipov, A., Bingham, E., Jara-Ettinger, J., … & Zador, A. (2024). NeuroAI for AI safety. arXiv preprint arXiv:2411.18526. ↩




