Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase. Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear. In this paper, we extend the policy improvement of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. We discuss effective design choices and implement our theory as a practical algorithm—Model-based Planning Distilled to Policy (MPDP)—that updates the policy jointly over multiple future time steps. Extensive experiments show that MPDP achieves better sample efficiency and asymptotic performance than both model-free and model-based planning algorithms on six continuous control benchmark tasks in MuJoCo.

This figure demonstrates the length of the adaptive horizon of MPDP. The solid lines denote the average horizon length evaluated on each training batch. As the interactions accumulate, the model generalizes better and our method rapidly adapts to longer horizons.

This figure shows the model error curves of MPDP with β varying from 0.2 to 0.7, measured by the average L2 norm of the predicted states on every 250 interactions. The model error decreases with β, which verifies that optimizing under our regularization effectively restricts behavior policy in the areas with low model error.

This figure shows the performance of MPDP with β varying from 0.2 to 0.7 along with MBPO on the Hopper task, evaluated over 4 trials. As β increases, the performance increases at first then decreases due to too strong restriction on the exploration. It also reveals that a larger regularization achieves more robust results.

Abstract

Overview

Previous Approach

Our Approach(MPDP)

Method

Lemma and Theorem

Results

Comparison with the previous model-based and model-free methods on MuJoCo-v2

Ablation Study

Effect on the designed adaptive horizon

Effect of the regularization on the model error

Cite This Paper

Acknowledgements