To complete a task specified by human instruction, the robot will query the scene representation for relevant information.
CLIP-Nav [1] examines CLIP’s capability in making sequential navigational decisions, and study how it influences the path that an agent takes.
At each time step:
This is done iteratively till the Stop Condition is reached.
--- # SayCan
With better VLMs, our system can continue to get better without any new robotic data.
> Note that we can query with a single object name, or object families, such as “snack” or “fruit”.