
Advances & challenges in foundation agents: Section 1.3.1 – From language models to AI agents
This article is Chapter 1, Section 1.3.1 of a series of articles featuring the book Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems.
Large language models can already analyse prose, write code, and argue a point, yet they live in a closed book: they read tokens, write tokens, and forget the scene once the page turns. An agent, by contrast, survives in the wild. Think of walking home after work: eyes check traffic, feet adjust to kerbs, memory recalls a shortcut, an inner film plays possible detours, hunger tugs the route towards a deli, and the whole routine improves with every trip. To grant an LLM the same street-smarts we must graft on several faculties, including (but not limited to) perception, action, memory, world-model, motivation, and learning, each as real as the words it speaks. The paragraphs that follow sketch those faculties, the headaches they bring, and how they fit into a single perception–cognition–action loop that anchors this series.
Perception: seeing and hearing beyond tokens. A text-only model is the cognitive equivalent of reading ticker tape in a dark room. Humans, in contrast, fuse vision, audition, and touch in parallel cortical streams1. Multimodal models such as GPT-4 already accept images2, yet an agent that sorts parts on a bench or watches market charts needs deeper channels. Three problems arise: fusion (aligning different sensor streams); noise (coping with glare, static, or hostile inputs); and task-aware attention (deciding which slice of the torrent matters now). Good perception is not a dashboard of data; it is a spotlight that obeys the mission.
Action: talking is cheap; doing is hard. Text output is enough for a chatbot, but powerless to open a door. Modern tool-using agents treat tokens as API calls: code execution in ReAct3, plugin invocations in CoALA4, motion scripts for robots. Once an agent can pay invoices or steer drones, safety becomes paramount. The designer must guarantee grounding (the model grasps an action’s real impact), syntax fidelity (calls obey the tool’s grammar), and alignment (behaviour stays within human intent).
Memory: more than a prompt window. People store decades of episodes; an LLM forgets everything outside its context length. External stores, including vector databases, structured logs, or the memory stream in generative agents5, let a model recall prior events and sustain identity across sessions. Headaches follow: curation (what to keep), retrieval (finding the right shard when it matters), and catastrophic forgetting6 if online updates rewrite the past. Memory must be selective yet searchable, stable yet plastic.
World model and planning: imagining before acting. Humans rehearse futures in the mind’s eye; a chess engine searches moves ahead; a model-based reinforcement learning (RL)7 agent predicts dynamics. An LLM agent benefits from an internal simulator (symbolic, neural8, or hybrid) to test “what-if” hypotheses before acting. Three hurdles block the way: accuracy (bad dreams mislead), compute (long roll-outs burn time), and arbitration (when to trust fast intuition and when to press the slow plan button).
Goals and motivation: the why behind the what. Every action needs a reason. Classical agent definitions stress sensing and acting9, but autonomous systems also need enduring agendas10,11. In code, goals appear as reward functions, symbolic objectives, or scripted drives. Poorly chosen, they invite reward-hacking12 and the paper-clip parable13. Safe goal design blends clear constraints, human oversight, and research on intrinsic motives such as curiosity.
Learning and adaptation: yesterday’s lessons, tomorrow’s edge. A frozen model stagnates; a courier robot must improve with every delivery. Continual-learning methods (replay, regularisation, dynamic layers14) aim to graft new skills without erasing old ones. Live updates, however, risk drift from the tested baseline, so production systems often confine change to side modules while keeping the core weights fixed.
Perception feeds observations to cognition; cognition updates a structured mental state and emits actions; the environment responds; the cycle repeats. This lean scaffold mirrors the brain’s cortex–subcortex dialogue and leaves later chapters free to zoom into specialised memories, hierarchical planners, or social protocols.
Based on the above discussions, in the following sections, the framework’s key concepts are outlined, introducing a unified agent architecture based on the perception–cognition–action loop enriched by reward signals and learning processes. Each subsystem is carefully defined and interconnected to ensure transparency in how memory, world models, emotions, goals, rewards, and learning interact. Cognition is formalized as a general reasoning mechanism, with planning and decision-making framed as specific “mental actions” shaping behavior. Connections to established theories, such as Minsky’s Society of Mind15, Buzsáki’s inside-out perspective16, and Bayesian active inference17, are explored to highlight the framework’s generality and biological plausibility
Next part: Section 1.3.2 – Core concepts and notations in the agent loop.
Article source: Liu, B., Li, X., Zhang, J., Wang, J., He, T., Hong, S., … & Wu, C. (2025). Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems. arXiv preprint arXiv:2504.01990. CC BY-NC-SA 4.0.
Header image: AI is Everywhere by Ariyana Ahmad & The Bigger Picture / Better Images of AI, CC BY 4.0.
References:
- Eric R Kandel, James H Schwartz, Thomas Jessell, Steven A Siegelbaum, and AJ Hudspeth. Principles of Neural Science, 2013. ↩
- OpenAI. ChatGPT (GPT-4). ↩
- Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: ReAct: Synergizing Reasoning and Acting in Language Models. In International Conference on Learning Representations (ICLR), 2023. ↩
- Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. Cognitive architectures for language agents. Transactions on Machine Learning Research, 2024. ↩
- Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023. ↩
- James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13): 3521–3526, 2017. ↩
- Richard S Sutton, Andrew G Barto, et al. Reinforcement Learning: An Introduction, Volume 1. MIT press Cambridge, 1998. ↩
- David Ha and Jürgen Schmidhuber. World Models. arXiv Preprint arXiv:1803.10122, 2018. ↩
- Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, NJ, 1 edition, 1995. ISBN 0-13-103805-2. ↩
- Pattie Maes. Artificial life meets entertainment: lifelike autonomous agents. Communications of the ACM, 38(11):108–114, 1995. ↩
- Stan Franklin and Art Graesser. Is it an agent, or just a program?: A taxonomy for autonomous agents. In International workshop on agent theories, architectures, and languages, pages 21–35. Springer, 1997. ↩
- Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete Problems in AI Safety. arXiv preprint arXiv:1606.06565, 2016. ↩
- Bostrom Nick. Superintelligence: Paths, Dangers, Strategies, 2014. ↩
- German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual Lifelong Learning with Neural Networks: A Review. Neural networks, 113:54–71, 2019. ↩
- Marvin Minsky. Society of Mind. Simon and Schuster, 1988. ↩
- Gyorgy Buzsaki. The Brain from Inside Out. Oxford University Press, USA, 2019. ↩
- Karl J Friston, Jean Daunizeau, James Kilner, and Stefan J Kiebel. Action and behavior: a free-energy formulation. Biological Cybernetics, 102:227–260, 2010. ↩




