Ming's List

View Full Calendar

Statistics Seminar - Jiantao Jiao (Berkeley)

Event Type

Seminar/Symposium

Sponsor

Department of Statistics

Virtual

Date

Nov 11, 2021 3:30 pm

Views

120

Originating Calendar

Department of Statistics Event Calendar

Title: Near-optimal algorithms for Imitation Learning

Abstract: We study the fundamental limits and efficient algorithms for imitation learning in Markov decision processes (MDP). It is known that Behavior Cloning (BC) achieves suboptimality growing quadratically in the horizon in imitation learning, which is termed error compounding in the literature. However, the only existing lower bound for BC is about optimization error, and it is not clear whether error compounding is a fundamental phenomenon related to sample inefficiency. We show that when the MDP transition function is unknown, all algorithms have to suffer a suboptimality that grows quadratically with the horizon even if the optimization error is zero and the algorithm can interactively query the expert such as in the setting of DAGGER. This negative result suggests that overcoming the error compounding requires imposing more assumptions in imitation learning. We show two approaches to mitigate the quadratic error compounding: active query with recoverable mistakes and known transition. First, we establish that if the expert knows how to recover from a temporary mistake and continue on the optimal path, then a carefully designed no-regret learning algorithm that actively queries the expert can provably beat the quadratic error compounding and achieve linear suboptimality growth in the horizon. Second, we transform the intuition that we should always utilize transition function knowledge to only visit states the expert has visited into a precise optimization problem named Mimic-MD, and introduce a novel technique named artificial trajectory synthesis to provably break the quadratic dependence on the horizon and improve the exponent to 3/2. Last, we show that under the additional assumption that the expert is optimal for the true reward function in the known transition setting, one can in fact further improve the exponent from 3/2 to 1 via a new algorithm named Mimic-Mixture in some special cases of MDP. We discuss the implementation details about Mimic-MD and Mimic-Mixture, and another conjectured algorithm utilizing inverse reinforcement learning which we believe achieves the optimal performance for very general settings.

Join Zoom Meeting
https://illinois.zoom.us/j/81579704788?pwd=ZHRwUU9rOHRHVklQOFB1cnRrMnQ5QT09

Meeting ID: 815 7970 4788
Password: 646606

link for robots only