MotIF: Motion Instruction Fine-tuning

1MIT, 2Stanford, 3CMU
Work done during internship at CMU.

ArXiv Code


Evaluating robot motions involves more than just the start and end states; it's about how the task is performed. We propose motion instruction fine-tuning (MotIF) and MotIF-1K dataset to improve VLMs' ability to understand nuanced robotic motions.

Abstract


While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up -- many tasks require observing the full motion of the robot to correctly determine success. This is not only because motions are sequential, but also because intermediate actions are necessary for understanding how the robot's motion is grounded in the environment. For example, a robot tasked with brushing someone's hair needs to have semantic grounding and understanding of its motion to recognize that different types of hair require different brushing trajectories. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs are trained only on single frames, and thus cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an aggregate input of multiple frames, they still fail to correctly detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoints trajectories on the initial image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories. MotIF assesses the success of robot motion given the image observation of the trajectory, task instruction, and motion description. Our model significantly outperforms state-of-the-art VLMs by at least twice in precision and 56.1% in recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in refining and terminating robot planning, and ranking trajectories on how they align with task and motion descriptions.


MotIF-1K Dataset

The dataset contains 653 human and 369 robot demonstrations across 13 task categories. For each trajectory, we collect the RGBD image observations, optical flow, single keypoint tracking, and annotate task and motion descriptions. For robot demonstrations, joint states are also in the dataset. In the following examples, task descriptions are written in white and motion descriptions are written in orange.


Robot Demonstrations

➤ Non-Interactive Tasks



➤ Object-Interactive Tasks



➤ Human-Interactive Tasks




Human Demonstrations

➤ Object-Interactive Tasks

➤ Human-Interactive Tasks




Motion Diversity

➤ "shake boba" human demonstrations



➤ "deliver lemonade" robot demonstrations



➤ "brush hair" robot demonstrations



➤ "style hair" human demonstrations



Various Visual Motion Representations

Single Keypoint
Single Keypoint
Optical Flow
Optical Flow
Optical Flow
2-frame Storyboard
Optical Flow
4-frame Storyboard


Grounded Motion Annotations

Category Task Motion Description Examples Demonstrations
Human Robot
Non-Interactive Outdoor Navigation move in the shortest path
make a detour to the left and follow the walkway, avoiding moving over the grass
-
Indoor Navigation move in the shortest path
make a detour to the right of the long table, avoiding collision with chairs
-
Draw Path make a triangular motion clockwise
move upward and to the right
-
Object-Interactive Shake move up and down 4 times
completely flip the object to the right and flip it back to its initial state
Pick and place move downward and to the left
move downward while getting farther from <obstacle>, then move to the left
Stir make 2 circular motions counter-clockwise
move upward, then move downward while making diagonal oscillations
Wipe move to the right and move to the left, repeating this sequence 2 times
move to the right, making diagonal oscillations
Open/Close the cabinet move to the right
move upward and to the left
-
Spread Condiment move to the left and to the right
move to the left while making back and forth oscillations
-
User-Interactive Handover move upward and to the left
move downward and to the right following a concave curve
-
Brush hair move downward while making horizontal oscillations
make 5 strokes downward, increasing the starting height of each stroke
Tidy hair move downward and to the right following a convex curve
make a circular motion clockwise, move upward, then move downward and to the right
Style hair move to the right shortly, then move to the left following a concave curve
make a circular motion clockwise, gradually increasing the radius of the circle

App. Table 1. List of Tasks and Motion Descriptions. The collected dataset contains 653 human and 369 robot demonstrations across 13 task categories, 36 task instructions, and 239 motion descriptions. Checkmarks denote which agent (human/robot) demonstrations exist for each task. The table provides two motion description examples for each task.



Results

How does MotIF compare to state-of-the-art VLMs?


Interpolate start reference image.

Fig. 7. Performance on MotIF-1K. (a) shows that our models outperforms state-of-the-art (SOTA) off-the-shelf models in validation and test splits. (b) We explore motion representations in terms of single-frame vs. multi-frame and the effectiveness of trajectory drawing. Single-frame with trajectory drawing demonstrates the highest recall in the test split, while other motion representations falter. Our approach identifies valid motions effectively and generalizes better than baselines.




Interpolate start reference image.

App. Fig. 10. Comparison between MotIF and state-of-the-art closed VLMs. We compare three VLMs: our model, GPT-4o, and Gemini-1.5 Pro, along a conversation between a user and an LLM. The user specifies the task and the LLM generates an appropriate motion description. The performance of each VLM is measured by predicting if the robot motions align (VLM response: 1) with motion descriptions suggested from the LLM or not (VLM response: 0), where the images are not included in training our model. Comparing the accuracy, precision, and recall for each model, MotIF shows the highest performance in all metrics.


App. Fig. 10 shows a toy scenario comparing GPT-4o, Gemini-1.5 Pro, and MotIF, along a conversation between a user and an LLM (ChatGPT). After three turns of conversation, we calculate the performance of motion discrimination using the VLMs. Results show that our model achieves the highest accuracy, precision, and recall, successfully understanding robotic motions in all cases.



Visualization of MotIF Outputs


➤ While motion description "move downward, then move to the left" is included in the training data, our VLM generalizes to understanding unseen motion descriptions (row 2-4 in table).


Task Instruction: pick up the cup and place it to the lower left of the laptop
Interpolate start reference image.
Motion Description MotIF Output Prediction Correctness
move downward, then move to the left 1
move farther from the laptop, move downward, then move to the left 1
move downward and to the left, passing over the laptop 0
move over the laptop 0


The following trajectory visualizes a motion in an unseen camera viewpoint. Although the training data only contain front view images for this task, our model effectively understands the robot's motion in an unseen camera viewpoint (topdown view).


Task Instruction: curl hair
Trajectory visualization image.
Motion Description MotIF Output Prediction Correctness
move downward, while making horizontal oscillations 1
move downward, while making side-to-side movements 1
move downward 0
move downward, while making vertical oscillations 0


The following trajectory visualizes a motion with an unseen object. While the training data only contains motions of spreading condiments on pizza with parmesan cheese, our model effectively understands the robot's motion with an unseen object, parsley, accurately determining ground truth and aligning paraphrased motion descriptions with the robot's motion.


Task Instruction: sprinkle parsley on pizza
Trajectory visualization image.
Motion Description MotIF Output Prediction Correctness
move to the left, while making vertical oscillations and alternating rotations 1
move to the left, while making vertical oscillations 1
move to the left, while making vertical shaking movements 1
move to the left in a straight line 0


The following trajectory visualizes an unseen grounded motion. Our model effectively understands the semantic grounding of the robot's motion in a navigation task, such as moving over or making a detour to avoid a specific instance (e.g., manhole) in the environment.


Task Instruction: deliver lemonade
Trajectory visualization image.
Motion Description MotIF Output Prediction Correctness
make a detour to the right of the manhole 1
move forward, making a detour to the right of the manhole 1
move forward in the shortest path 0
move forward in a straight line, moving over the manhole 0



Qualitative Analysis on Off-the-shelf VLMs

Example Trajectory of a robot brushing hair. The robot is brushing the hair by moving downward, while making horizontal oscillations.


To assess the motion understanding capabilities of off-the-shelf VLMs, we conducted an experiment where GPT-4V and GPT-4o were asked to describe a robot's motion from a video of it brushing hair. Although GPT-4o generated a detailed response by extracting keyframes and detecting the trajectory, it ultimately failed to describe the specific shape of the motion or how it was grounded in the scene.



The model is given a video of the robot's motion with the question prompt. The video does not include any trajectory visualization. Given the video, the model fails to describe the semantic meaning of the robot's motion.




The model is given a video of the robot's motion with the robot's trajectory overlaid, and the question prompt.

BibTeX

@article{hwang2024motif,
  title     = {MotIF: Motion Instruction Fine-tuning},
  author    = {Hwang, Minyoung and Hejna, Joey and Sadigh, Dorsa and Bisk, Yonatan},
  booktitle = {arXiv preprint arXiv:2409.10683},
  year      = {2024},
}

Acknowledgements

We thank Abitha Thankaraj, Hao Zhu, Leena Mathur, Quanting Xie, Rosa Vitiello, Su Li, Tiffany Min, Vidhi Jain, and Yingshan Chang for helping us collect the dataset and providing thoughtful feedback. Names are listed in alphabetical order.