Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning


Chuming Li* 1,2
Ruonan Jia* 1,3
Jie Liu1
Yinmin Zhang1,2
Yazhe Niu1
Yaodong Yang4
Yu Liu1
Wanli Ouyang1

1Shanghai AI Lab
2University of Sydney
3Tsinghua University
4Peking University

ECAI 2023



Paper [arXiv]

Cite [BibTeX]


Abstract

Model-based reinforcement learning (RL) has demonstrated remarkable successes on a range of continuous control tasks due to its high sample efficiency. To save the computation cost of conducting planning online, recent practices tend to distill optimized action sequences into an RL policy during the training phase. Although the distillation can incorporate both the foresight of planning and the exploration ability of RL policies, the theoretical understanding of these methods is yet unclear. In this paper, we extend the policy improvement of Soft Actor-Critic (SAC) by developing an approach to distill from model-based planning to the policy. We then demonstrate that such an approach of policy improvement has a theoretical guarantee of monotonic improvement and convergence to the maximum value defined in SAC. We discuss effective design choices and implement our theory as a practical algorithm—Model-based Planning Distilled to Policy (MPDP)—that updates the policy jointly over multiple future time steps. Extensive experiments show that MPDP achieves better sample efficiency and asymptotic performance than both model-free and model-based planning algorithms on six continuous control benchmark tasks in MuJoCo.




Overview

Previous Approach

A thorough component comparison of relevant algorithms is given in the following table.

Our Approach(MPDP)

We propose a model-based extended policy improvement method, which utilizes model-based planning to distill RL policy and model regularization to reduce the model errors. We also demonstrate that our method has a theoretical guarantee of monotonic improvement and convergence.



Method

We conclude our extended policy improvement in Algorithm 1. The algorithm processes a batch of states at each iteration and the model rollouts states until the task terminates. To separate the policy on each time step, we maintain the policy networks at H time steps. The policy networks generate the actions for each step and are updated jointly in our extended improvement step. After the rollout, the policy networks are updated with the gradients to the action sequence.

The complete algorithm is described in Algorithm 2. The method alternates among using the initial policy on the first step to interact with the environment, training an ensemble of models, and updating the policy with policy evaluation and our extended policy improvement from batches sampled from the replay buffer.



Lemma and Theorem

We propose an approach to distilling the solution of model-based planning into the policy, which is a multi-step extension of policy improvement of original SAC. We verify its theoretical properties. See Appendix for more details.



Results

Comparison with the previous model-based and model-free methods on MuJoCo-v2

Performance curves for our method (MPDP) and baselines on MuJoCo continuous control benchmarks. Solid lines depict the mean of four random seeds and shaded regions correspond to standard deviation among seeds. The dashed lines indicate the asymptotic performance of PETS at the corresponding training steps (15k steps for InvertedPendulum, 100k steps for Hopper, and 200k steps for the other tasks) and SAC at 2M steps.



Ablation Study

Effect on the designed adaptive horizon

This figure demonstrates the length of the adaptive horizon of MPDP. The solid lines denote the average horizon length evaluated on each training batch. As the interactions accumulate, the model generalizes better and our method rapidly adapts to longer horizons.

Effect of the regularization on the model error

This figure shows the model error curves of MPDP with β varying from 0.2 to 0.7, measured by the average L2 norm of the predicted states on every 250 interactions. The model error decreases with β, which verifies that optimizing under our regularization effectively restricts behavior policy in the areas with low model error.

This figure shows the performance of MPDP with β varying from 0.2 to 0.7 along with MBPO on the Hopper task, evaluated over 4 trials. As β increases, the performance increases at first then decreases due to too strong restriction on the exploration. It also reveals that a larger regularization achieves more robust results.



Cite This Paper


                @inproceedings{li2023mpdp,
                    author = {Chuming Li, Ruonan Jia, Jie Liu, Yinmin Zhang, Yazhe Niu, Yaodong Yang, Yu Liu, Wanli Ouyang},
                    title = {Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning},
                    journal = {arXiv:2307.12933},
                    year = {2023}
                }
            


Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.