
Advances & challenges in foundation agents: Section 2.1.3 – Learning space
This article is Section 2.1.3 of a series of articles featuring Liu and colleagues’ book Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems.
The learning approaches in LLM agents represent a structured, data-driven paradigm in contrast to the exploratory, emotionally-driven learning observed in humans. While human learning often involves active curiosity, motivation, and emotional reinforcement, LLM-based agents typically learn through more formalized processes, such as parameter updates during training or structured memory formation during exploration. Current agent architectures attempt to bridge this gap by implementing mechanisms that simulate aspects of human learning while leveraging the strengths of computational systems.
Learning within an intelligent agent occurs across different cognitive spaces, encompassing both large-scale model updates and more localized changes to modularized mental states M. In systems where the model is the only trainable component, the model parameters θ can be viewed as constituting or encoding the entire mental state. More generally, the mental state can include a combination of subsystems:
M = { Mθ, Mmem, Mwm, Memo, Mgoal, Mrew }
where Mθ, denotes the core model parameters, Mmem represents memory, Mwm denotes the world model, Memo indicates emotional state, Mgoal represents goals, and Mrew represents reward signals. Modifications to Mθ—the core model—often lead to holistic changes that affect all components of the mental state. In contrast, more targeted updates to memory, world model, or reward components allow the agent to adapt specific subsystems while preserving general capabilities. For instance, learning experiences and skills from the environment primarily influence memory, while leveraging the LLM’s inherent predictive capabilities enhances the world model. The distinction between these two paradigms—full mental state learning and partial component-level adaptation—is illustrated in Figure 2.3.

Full mental state learning. Full mental state learning enhances the capabilities of an agent through comprehensive modifications to Mθ, which in turn influences all components of M. This process begins with pre-training, which establishes the foundation of language models by acquiring vast world knowledge, analogous to how human babies absorb environmental information during development, though in a more structured and extensive manner.
Post-training techniques represent the cornerstone for advancing agent capabilities. Similar to how human
brains are shaped by education, these techniques, while affecting the entire model, can emphasize different aspects of cognitive development. Specifically, various forms of tuning-based learning enable agents to acquire domain-specific knowledge and logical reasoning capabilities. Supervised Fine-Tuning (SFT)1 serves as the fundamental approach where models learn from human-labeled examples, encoding knowledge directly into the model’s weights. For computational efficiency, Parameter-Efficient Fine-Tuning (PEFT) methods have emerged. Adapter-BERT2 introduced modular designs that adapt models to downstream tasks without modifying all parameters, while Low-Rank Adaptation (LoRA)3 achieves similar results by decomposing weight updates into low-rank matrices, adjusting only a small subset of effective parameters.
Some agent capabilities are closely connected to how well they align with human preferences, with alignment-based learning approaches modifying Mθ to reshape aspects of the agent’s underlying representations. Reinforcement learning from human feedback (RLHF)4 aligns models with human values by training a reward model on comparative judgments and using this to guide policy optimization. InstructGPT5 demonstrated how this approach could dramatically improve consistency with user intent across diverse tasks. Direct Preference Optimization (DPO)6 has further simplified this process by reformulating it as direct preference learning without explicit reward modeling, maintaining alignment quality while reducing computational complexity.
Reinforcement learning (RL) presents a promising pathway for specialized learning in specific environments. RL has shown particular promise in enhancing reasoning capabilities, essentially enabling the agent to refine its internal thinking processes. Foundational works such as Reinforcement Fine-Tuning (ReFT)7 enhance reasoning through fine-tuning with automatically sampled reasoning paths under online reinforcement learning rewards. DeepSeek-R18 advances this approach through rule-based rewards and Group Relative Policy Optimization (GRPO)9, while Kimi k1.510 combines contextual reinforcement learning with optimized chain-of-thought techniques to improve both planning processes and inference efficiency. In specific environments, modifying models to enhance agents’ understanding of actions and external environments has proven effective, as demonstrated by DigiRL11, which implements a two-stage reinforcement learning approach enabling agents to perform diverse commands on real-world Android device simulators.
Recent works have attempted12 to13 integrate14 agent15 action16 spaces17 directly into model training, enabling learning of appropriate actions for different states through RL or SFT methods. This integration fundamentally affects the agent’s memory, reward understanding, and world model comprehension, pointing toward a promising direction for the emergence of agentic models.
Partial mental state learning. While full mental state learning through updates to Mθ provides comprehensive capability updates, learning focused on particular components of M represents another essential and often more efficient approach. Such partial mental state learning can be achieved either through targeted model updates or through in-context adaptation without parameter changes.
In-context learning (ICL) illustrates how agents can effectively modify specific mental state components without modifying the underlying model. This mechanism allows agents to adapt to new tasks by leveraging examples or instructions within their context window, paralleling human working memory’s role in rapid task adaptation. Chain-of-thought (CoT)18 demonstrates the effectiveness of this approach, showing how agents can enhance specific cognitive capabilities while maintaining their base model parameters unchanged.
The feasibility of partial mental state learning is evidenced through various approaches targeting different components such as memory (Mmem), reward (Mrew), and world model (Mwm). Through normal communication and social interaction, Generative agents19 demonstrate how agents can accumulate and replay memories, extracting high-level insights to guide dynamic behavior planning. In environmental interaction scenarios, Voyager20 showcases how agents can continuously update their skill library through direct engagement with the Minecraft environment, accumulating procedural knowledge without model retraining. Mem021 provides agents with persistent and efficient long-term memory through scalable dynamic memory management and graph-based memory representations, significantly enhancing their ability to handle complex, multi-session tasks.
Learn-by-Interact22 further extends this approach by synthesizing experiential data through direct environmental interaction, eliminating the need for manual annotation or reinforcement learning frameworks. Additionally, agents can learn from their mistakes and improve through reflection, as demonstrated by Reflexion23, which guides agents’ future thinking and actions by obtaining textual feedback from repeated trial and error experiences. Further, KnowSelf24 introduces knowledgeable self-awareness, enabling agents to intelligently assess whether external knowledge is needed or to self-correct based on specific contexts, leading to more efficient and strategic planning.
Modifications to reward and world models provide another example of partial mental state learning. ARMAP25 refines environmental reward models by distilling them from agent action trajectories, providing a foundation for further learning. AutoMC26 constructs dense reward models through environmental exploration to support agent behavior. Meanwhile, LLMs are explicitly leveraged as world models27 to predict the impact of future actions, effectively modifying the agent’s world understanding (Mwm). ActRe28 builds upon the language model’s inherent world understanding to construct tasks from trajectories, enhancing the agent’s capabilities as both a world model and reasoning engine through iterative training.
Next part: Section 2.1.4 – Learning objective.
Article source: Liu, B., Li, X., Zhang, J., Wang, J., He, T., Hong, S., … & Wu, C. (2025). Advances and challenges in foundation agents: From brain-inspired intelligence to evolutionary, collaborative, and safe systems. arXiv preprint arXiv:2504.01990. CC BY-NC-SA 4.0.
Header image: AI is Everywhere by Ariyana Ahmad & The Bigger Picture / Better Images of AI, CC BY 4.0.
References:
- Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,
Andrew M Dai, and Quoc V Le. Finetuned Language Models Are Zero-Shot Learners. arXiv preprint
arXiv:2109.01652, 2021. ↩ - Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-Efficient Transfer Learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. ↩
- Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685, 2021. ↩
- Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-Tuning Language Models from Human Preferences. arXiv preprint arXiv:1909.08593, 2019. ↩
- Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022. ↩
- Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems, 36:53728–53741, 2023. ↩
- Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. ReFT: Reasoning with Reinforced Fine-Tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. ↩
- Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948, 2025. ↩
- Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. CoRR, abs/2402.03300, 2024. ↩
- Kimi Team. Kimi k1.5: Scaling Reinforcement Learning with LLMs. arXiv preprint arXiv:2501.12599, 2025. ↩
- Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning. arXiv preprint arXiv:2406.11896, 2024. ↩
- Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2503.05592, 2025. ↩
- Bowen Jin, Hansi Zeng, Zhenrui Yue, Dong Wang, Hamed Zamani, and Jiawei Han. Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning. arXiv preprint arXiv:2503.09516, 2025. ↩
- Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, et al. SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis. arXiv preprint arXiv:2505.16834, 2025. ↩
- Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, Tianjie Zhang, Wei Xu, Jie Tang, and Yuxiao Dong. WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning. arXiv preprint arXiv:2411.02337, 2025. ↩
- Shiyi Cao, Sumanth Hegde, Dacheng Li, Tyler Griggs, Shu Liu, Eric Tang, Jiayi Pan, Xingyao Wang, Akshay Malik, Graham Neubig, Kourosh Hakhamaneshi, Richard Liaw, Philipp Moritz, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. SkyRL: Train Real-World Long-Horizon Agents via Reinforcement Learning, 2025. ↩
- Zhixun Chen, Ming Li, Yuxuan Huang, Yali Du, Meng Fang, and Tianyi Zhou. ATLaS: Agent Tuning via Learning Critical Steps. arXiv preprint arXiv:2503.02197, 2025. ↩
- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 – December 9, 2022, 2022. ↩
- Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 1–22, 2023. ↩
- Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv preprint arXiv:2305.16291, 2023. ↩
- Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv preprint arXiv:2504.19413, 2025. ↩
- Hongjin Su, Ruoxi Sun, Jinsung Yoon, Pengcheng Yin, Tao Yu, and Sercan Ö Arık. Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments. arXiv preprint arXiv:2501.10893, 2025. ↩
- Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning. In Neural Information Processing Systems, 2023. ↩
- Shuofei Qiao, Zhisong Qiu, Baochang Ren, Xiaobin Wang, Xiangyuan Ru, Ningyu Zhang, Xiang Chen, Yong Jiang, Pengjun Xie, Fei Huang, et al. Agentic Knowledgeable Self-awareness. arXiv preprint arXiv:2504.03553, 2025. ↩
- Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, and Chuang Gan. Scaling Autonomous Agents via Automatic Reward Modeling and Planning. arXiv preprint arXiv:2502.12130, 2025. ↩
- Hao Li, Xue Yang, Zhaokai Wang, Xizhou Zhu, Jie Zhou, Yu Qiao, Xiaogang Wang, Hongsheng Li, Lewei Lu, and Jifeng Dai. Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16426–16435, 2024. ↩
- Yu Gu, Boyuan Zheng, Boyu Gou, Kai Zhang, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, Huan Sun, and Yu Su. Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents. arXiv preprint arXiv:2411.06559, 2024. ↩
- Zonghan Yang, Peng Li, Ming Yan, Ji Zhang, Fei Huang, and Yang Liu. ReAct Meets ActRe: Autonomous Annotations of Agent Trajectories for Contrastive Self-Training. arXiv preprint arXiv:2403.14589, 2024. ↩




