QMP: Q-switch Mixture of Policies
for Multi-Task Behavior Sharing

Grace Zhang*¹, Ayush Jain*¹, Injune Hwang², Shao-Hua Sun³, Joseph J Lim²

¹University of Southern California, ²KAIST, ³National Taiwan University

*Equal Contribution

International Conference on Learning Representations (ICLR), 2025

QMP shares behaviors between tasks via off-policy data collection to improve sample-efficiency.

Abstract

QMP is a multi-task reinforcement learning approach that shares behaviors between tasks using a mixture of policies for off-policy data collection. We show that using the Q-function as a switch for this mixture is guaranteed to improve sample efficiency. The Q-switch selects which policy among the mixture that maximizes the current task's Q-value for the current state. This works because other policies might have already learned overlapping behaviors that the current task's policy has not learned. QMP's behavior sharing shows complementary gains over common approaches like parameter sharing and data sharing.

Illustrative Example: 2D Point Reaching

We show an illustrative example of QMP's mixture of policies helping in a 2D point reaching task of reaching from start (bottom-left) to goal (top-right). We visualize the off-policy trajectories collected in:
(a) without QMP: Simple SAC without QMP converges to a suboptimal policy that cannot reach the goal.
(b) QMP with 3 diverse policies: Q-function often selects other policies, filling the suboptimality in the single policy.
(c) QMP with relevant policies: The Top-Right Policy is selected often because it best optimizes the learned Q-function.

QMP is complementary to various MTRL frameworks

We show that behavior sharing with QMP (solid lines) maintains or improves performance of various MTRL frameworks like no sharing (independent policies), parameter sharing, and zero-labeled data sharing.

QMP outperforms other ways to share behaviors

Multistage Reacher

Only QMP and DnC (reg.) are able to solve all 5 tasks in Multistage Reacher, where the first 4 tasks involve sequential reaching to different goals and the last goal requires staying at a fixed goal.

MetaWorld-CDS

QMP and No-Shared-Behavior outperform all other methods that share behaviors in a way that adds bias to the objective, which causes negative interface.