QMP: Q-switch Mixture of Policies
for Multi-Task Behavior Sharing

1University of Southern California, 2KAIST, 3National Taiwan University
*Equal Contribution
Teaser Image

QMP shares behaviors between tasks via off-policy data collection to improve sample-efficiency.

Abstract

QMP is a multi-task reinforcement learning approach that shares behaviors between tasks using a mixture of policies for off-policy data collection. We show that using the Q-function as a switch for this mixture is guaranteed to improve sample efficiency. The Q-switch selects which policy among the mixture that maximizes the current task's Q-value for the current state. This works because other policies might have already learned overlapping behaviors that the current task's policy has not learned. QMP's behavior sharing shows complementary gains over common approaches like parameter sharing and data sharing.

Model Architecture Image

Illustrative Example: 2D Point Reaching

We show an illustrative example of QMP's mixture of policies helping in a 2D point reaching task of reaching from start (bottom-left) to goal (top-right). We visualize the off-policy trajectories collected in:
(a) without QMP: Simple SAC without QMP converges to a suboptimal policy that cannot reach the goal.
(b) QMP with 3 diverse policies: Q-function often selects other policies, filling the suboptimality in the single policy.
(c) QMP with relevant policies: The Top-Right Policy is selected often because it best optimizes the learned Q-function.

Illustrative Example

QMP is complementary to various MTRL frameworks

We show that behavior sharing with QMP (solid lines) maintains or improves performance of various MTRL frameworks like no sharing (independent policies), parameter sharing, and zero-labeled data sharing.


QMP is complementary

QMP outperforms other ways to share behaviors

Multistage Reacher

Only QMP and DnC (reg.) are able to solve all 5 tasks in Multistage Reacher, where the first 4 tasks involve sequential reaching to different goals and the last goal requires staying at a fixed goal.





MetaWorld-CDS

QMP and No-Shared-Behavior outperform all other methods that share behaviors in a way that adds bias to the objective, which causes negative interface.




QMP outperforms behavior sharing baselines across all environments

QMP outperforms behavior sharing baselines

QMP Result Videos

QMP on MetaWorld-10

Reach

Reach

Push

Push

Pick Place

Pick Place

Door Open

Door Open

Drawer Open

Drawer Open

Drawer Close

Drawer Close

Button Press

Button Press

Peg Insert Side

Peg Insert Side

Window Close

Window Close

Window Open

Window Open

QMP on Maze

Task 0

Task 1

Task 2

Task 3

Task 4

Task 5

Task 6

Task 7

Task 8

Task 9