Partially observability is ubiquitous in applications of Reinforcement Learning (RL), in which agents learn to make a sequence of decisions despite lacking complete information about the latent states of the controlled system. Partially observable RL is notoriously difficult in theory---well-known information-theoretic results show that learning partially observable Markov decision processes (POMDPs) requires an exponential number of samples in the worst case. Yet, this does not rule out the possible existence of interesting subclasses of POMDPs, which include a large set of partial observable applications in practice while being tractable.
In this talk we identify a rich family of tractable POMDPs, which we call weakly revealing POMDPs. This family rules out the pathological instances of POMDPs where observations are uninformative to a degree that makes learning hard. We prove that for weakly revealing POMDPs, a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to guarantee a polynomial sample complexity. We will also show how these frameworks and techniques further lead to sample efficient learning for RL with function approximation or multiagent RL under partial observability. To the best of our knowledge, this gives the first line of provably sample-efficient results for learning from interactions in partially observable RLs. This is based on joint works with Qinghua Liu, Alan Chung, Sham Kakade, Akshay Krishnamurthy, Praneeth Netrapalli, and Csaba Szepesvari.
Chi Jin is an assistant professor at the Electrical and Computer Engineering department of Princeton University. He obtained his PhD degree in Computer Science at University of California, Berkeley, advised by Michael I. Jordan. His research mainly focuses on theoretical machine learning, with special emphasis on nonconvex optimization and reinforcement learning. His representative work includes proving noisy gradient descent escape saddle points efficiently and proving the efficiency of Q-learning and least-squares value iteration when combined with optimism in reinforcement learning.