Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences

1Allen Institute for Artifical Intelligence, 2Massachusetts Institute of Technology, 3Korea Advanced Institute of Science and Technology


ArXiv Code


Interpolate start reference image.

How can we effectively customize a robot for human users? We propose Promptable Behaviors, a novel personalization framework that deals with diverse preferences without re-training the agent.

Abstract


Customizing robotic behaviors to be aligned with diverse human preferences is an underexplored challenge in the field of embodied AI. In this paper, we present Promptable Behaviors, a novel framework that facilitates efficient personalization of robotic agents to diverse human preferences in complex environments. We use multi-objective reinforcement learning to train a single policy adaptable to a broad spectrum of preferences. We introduce three distinct methods to infer human preferences by leveraging different types of interactions: (1) human demonstrations, (2) preference feedback on trajectory comparisons, and (3) language instructions. We evaluate the proposed method in personalized object-goal navigation and flee navigation tasks in ProcTHOR and RoboTHOR, demonstrating the ability to prompt agent behaviors to satisfy human preferences in various scenarios.


Diverse Human Preferences in Realistic Scenarios

Imagine a robot navigating in a house at midnight, asked to find an object without disturbing a child who just fell asleep. The robot is required to explore the house thoroughly in order to find the target object, but not collide with any objects to avoid making unnecessary noise. In contrast to this Quiet Operation scenario, in the Urgent scenario, a user is in a hurry and expects a robot to find the target object quickly rather than avoiding collisions. These contrasting scenarios highlight the need for customizing robot policies to adapt to diverse and specific human preferences. However, conventional approaches have shortcomings in dealing with diverse preferences, since the agent has to be re-trained for each unique human preference.


Interpolate start reference image.


Promptable Behaviors

We propose Promptable Behaviors, a novel personalization framework that deals with diverse human preferences without re-training the agent. The key idea of our method is to use multi-objective reinforcement learning (MORL) as the backbone of personalizing a reward model. We take a modular approach:

  1. training a policy conditioned on a reward weight vector across multiple objectives and
  2. inferring the reward weight vector that aligns with the user's preference.
Using MORL, agent behaviors become promptable through adjustments in the reward weight vector during inference, without any policy fine-tuning. This significantly simplifies customizing robot behaviors to inferring a low-dimensional reward weight vector.


Interpolate start reference image.
We provide a variety of options for users to provide their preferences to the agent. Specifically, we introduce three distinct methods of reward weight prediction leveraging different types of interaction: (1) human demonstrations, (2) preference feedback on trajectory comparisons, and (3) language instructions.


Results

Our Method Achieves High Success Rates While Efficiently Optimizing the Agent Behavior

We evaluate our method on personalized object-goal navigation (ObjectNav) and flee navigation (FleeNav) in ProcTHOR and RoboTHOR, environments in the AI2-THOR simulator. The policy is evaluated across various scenarios to ensure that it aligns with human preferences and achieves satisfactory performance in both tasks. We show that the proposed method effectively prompts agent behaviors by adjusting the reward weight vector and infers reward weights from human preferences using three distinct reward weight prediction methods.


For instance in ObjectNav, when house exploration is prioritized, the proposed method shows the highest success rate (row j in Table 1), 11.3% higher than EmbCLIP, while Prioritized EmbCLIP shows the lowest success rate (row d in Table 1) among all methods and reward configurations. Additionally, our method achieves the highest SPL and the path efficiency reward when path efficiency is prioritized (row i in Table 1), outperforming EmbCLIP by 19.3% and 56.1%, respectively. This implies that the proposed method effectively maintains general performance while satisfying the underlying preferences in various prioritizations.



Interpolate start reference image.

Table 1. Performance in ProcTHOR ObjectNav. We evaluate each method in the validation set with six different configurations of objective prioritization: uniform reward weight across all objectives and prioritizing a single objective 4 times as much as other objectives. Sub-rewards for each objective are accumulated during each episode, averaged across episodes, and then normalized using the mean and variance calculated across all methods. Colored cells indicate the highest values in each sub-reward column.




Interpolate start reference image.

Figure 1. Trajectory Visualizations. In each figure, agent trajectory is visualized when an objective is prioritized 10 times as much as other objectives. The agent's final location is illustrated as a star.



Conflicting Objectives Affect Each Other

We observe trade-offs between two conflicting objectives in ObjectNav: safety and house exploration. As we increase the weight for safety, the safety reward increases while the reward for its conflicting objective, house exploration, decreases.


Interpolate start reference image.

Figure 2. Conflicting Objectives. As we prioritize safety more, the average safety reward increases while the average reward of a conflicting objective, house exploration, decreases. We normalize the rewards for each objective using the mean and variance calculated across all weights.



Predicting Reward Weights from Human Preferences

Now, we compare three reward weight prediction methods and show the results of Promptable Behaviors for the full pipeline. As mentioned above, the users have three distinct options to describe their preferences to the agent: (1) demonstrating a trajectory, (2) labeling their preferences on trajectory comparisons, and (3) providing language instructions. Table 2 shows the quantitative performance of the three weight prediction methods, each with its own advantage.



Interpolate start reference image.

Table 2. Comparison of Three Weight Prediction Methods in ProcTHOR ObjectNav. We predict the optimal reward weights from human demonstrations, preference feedback on trajectory comparisons, and language instructions. We measure the cosine similarity (Sim) between the predicted weights and the weights designed by human experts. We also calculate generalized gini index (GGI) which measures the peakedness of the predicted weights



We also perform human evaluations by asking participants to compare trajectories generated with the predicted reward weights for different scenarios. Results in Table 3 show that group trajectory comparison, especially with two trajectories per group, achieves the highest win rate, significantly outperforming other methods by up to 17.8%. This high win rate indicates that the generated trajectories closely align with the intended scenarios.



Interpolate start reference image.

Table 3. Human Evaluation on Scenario-Trajectory Matching. Participants evaluate trajectories generated with the trained policy and the reward weights predicted for five scenarios in ObjectNav.



Full Framework Demonstrations

We conduct real human experiments given the five scenarios as follows:

  1. Urgent: The user is getting late to an important meeting and needs to quickly find an object in the house.
  2. Energy Conservation: The user wants to check an appliance in the house while the user is away, but the robot that has a limited battery life.
  3. New Home: The user just moved in and wants to find which furniture or object is located while inspecting the layout of the house as a video.
  4. Post-Rearrangement: After rearranging the house, the user does not remember where certain objects were placed. The user wants to find a specific object, while also inspecting other areas to confirm the new arrangement.
  5. Quiet Operation: At midnight, the user wants to find an object in the house without disturbing a sleeping child with any loud noise.



New House Scenario + Reward Weight Prediction from Human Demonstration


Interpolate start reference image.

Energy Conservation Scenario + Reward Weight Prediction from Preference Feedback (Pairwise Trajectory Comparison)


Interpolate start reference image.

Energy Conservation Scenario + Reward Weight Prediction from Preference Feedback (Group Trajectory Comparison)


Interpolate start reference image.

Quiet Operation Scenario + Reward Weight Prediction from Language Instruction


Interpolate start reference image.

BibTeX