Posterior Sampling Model-based Policy Optimization under Approximate Inference

Abstract

Model-based reinforcement learning algorithms (MBRL) hold tremendous promise for improving the sample efficiency in online RL. However, many existing popular MBRL algorithms cannot deal with exploration and exploitation properly. Posterior sampling reinforcement learning (PSRL) serves as a promising approach for automatically trading off the exploration and exploitation, but the theoretical guarantees only hold under exact inference. In this paper, we show that adopting the same methodology as in exact PSRL can be fairly suboptimal under approximate inference. Motivated by the analysis, we propose an improved factorization for the posterior distribution of polices by removing the conditional independence between the policy and data given the model. By adopting such a posterior factorization, we further propose a general algorithmic framework for PSRL under approximate inference and a practical instantiation of it. Empirically, our algorithm can surpass the baseline methods by a significant margin on both dense rewards and sparse rewards tasks from DM control suite, OpenAI Gym and Metaworld benchmarks.

BibTeX

@article{wang22psmbpo,
  author    = {Wang, Chaoqi and Chen, Yuxin and Murphy, Kevin},
  title     = {Model-based Policy Optimization under Approximate Bayesian Inference},
  journal   = {},
  year      = {2023},
}

Model-based Policy Optimization under Approximate Bayesian Inference

Abstract

PS-MBPO on dense reward tasks

OPS-MBPO on sparse reward tasks

Increasing the ensemble size of policy networks (M=1 to 5)

Poster

BibTeX