Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback

ICLR 2025

1Cornell University 2ASAPP Research *Equal Contribution

LEAP iteratively fine-tunes LLM agents using on-policy feedback from privileged AI teachers.

Overall Approach Image

LEAP overview. The LLM student agent interacts with the environment, generating a reason-action trajectory (in orange) based on its policy πi-1. An expert teacher, with privileged state available only during training, evaluates and corrects the trajectory (in green). These corrections update the learner's policy to πi through SFT/DPO training. The updated policy πi is then rolled out at test time without access to privileged state.

Abstract

While large language models (LLMs) show impressive decision-making abilities, current methods lack a mechanism for automatic self-improvement from errors during task execution. We propose LEAP, an iterative fine-tuning framework that continually improves LLM agents using feedback from AI expert teachers. Our key insight is to equip the expert teachers with a privileged state -- information available during training but hidden at test time. This allows even weak experts to provide precise guidance, significantly improving the student agent's performance without access to privileged information at test time. We evaluate LEAP on diverse decision-making benchmarks, including text-based games, web navigation, and interactive coding. Our experiments show that LEAP (1) outperforms state-of-the-art baselines (2) enables weak student models (e.g., Llama3-8B) to exceed the performance of strong teacher models (GPT4-o), and (3) allows weak models to self-improve using privileged versions of themselves. We also provide a theoretical analysis showing that LEAP's success hinges on balancing privileged information with the student’s realizability, which we empirically validate.

Fine-tuning LLM Agents from AI Feedback

Framework

Challenge: How can LLM teacher compute optimal corrections?

Key Idea: Equip LLM Teacher with Privileged State

Privileged State

Evaluation

Evaluation

Why does LEAP work?

balance_privileged_realizability

Hypothesis 1: LEAP balances privileged information with realizable corrections

sft_privileged

Hypothesis 2: LEAP provides on-policy corrections

BibTeX

@inproceedings{choudhury2024better,
    title={Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback},
    author={Choudhury, Sanjiban and Sodhi, Paloma},
    booktitle={International Conference on Learning Representations (ICLR)}
    journal={arXiv preprint arXiv:2410.05434},
    year={2025}
  }