QMP is a multi-task reinforcement learning approach that shares behaviors between tasks using a mixture of policies for off-policy data collection. We show that using the Q-function as a switch for this mixture is guaranteed to improve sample efficiency. The Q-switch selects which policy among the mixture that maximizes the current task's Q-value for the current state. This works because other policies might have already learned overlapping behaviors that the current task's policy has not learned. QMP's behavior sharing shows complementary gains over common approaches like parameter sharing and data sharing.
We show an illustrative example of QMP's mixture of policies helping in a 2D point reaching task of reaching from start (bottom-left) to goal (top-right).
We visualize the off-policy trajectories collected in:
(a) without QMP: Simple SAC without QMP converges to a suboptimal policy that cannot reach the goal.
(b) QMP with 3 diverse policies: Q-function often selects other policies, filling the suboptimality in the single policy.
(c) QMP with relevant policies: The Top-Right Policy is selected often because it best optimizes the learned Q-function.
We show that behavior sharing with QMP (solid lines) maintains or improves performance of various MTRL frameworks like no sharing (independent policies), parameter sharing, and zero-labeled data sharing.
Only QMP and DnC (reg.) are able to solve all 5 tasks in Multistage Reacher, where the first 4 tasks involve sequential reaching to different goals and the last goal requires staying at a fixed goal.
QMP and No-Shared-Behavior outperform all other methods that share behaviors in a way that adds bias to the objective, which causes negative interface.
Reach
Push
Pick Place
Door Open
Drawer Open
Drawer Close
Button Press
Peg Insert Side
Window Close
Window Open
Task 0
Task 1
Task 2
Task 3
Task 4
Task 5
Task 6
Task 7
Task 8
Task 9