MO-VLN: A Multi-Task Benchmark for Open-set
Zero-Shot Vision-and-Language Navigation

Xiwen Liang1*, Liang Ma1*, Shanshan Guo2, Jianhua Han3, Hang Xu3,
Shikui Ma4, Xiaodan Liang1

1Shenzhen Campus of Sun Yat-Sen University, 2Northeastern University,
3Huawei Noah's Ark Lab, 4Dataa Robotics

Update

🚀🚀[8/17/2023]v0.2.0: More assets!2 new scenes,50 new walkers,954 new objects,1k+ new instructions

We have released version of the MO-VLN benchmark simulator.

[6/18/2023]v0.1.0: 3 scenes,2165 objects, real light, shadow characteristics, and support instruction tasks with four tasks

We have released version of the MO-VLN benchmark simulator.

Todo List

Demo Videos

V 0.1.0

V 0.2.0

Complex Tasks for Grasping and Navigation

Recorded Human Demonstration Tracks by VR

Our Simulator Scenes

cafe
nursing house
restaurant
home scene
separate tables

Dynamic Presentation of Features

Comparison with other simulators

It features real-time lighting, capable of realistically generating features that may appear in images captured by cameras in real scenarios, such as illuminated shadows, reflections and refractions from transparent mirrors, and bright spots caused by non-transparent mirror reflections.

The simulation environment incorporates humans who may partially obstruct the path in the corridor, or potentially walk into the robot's route. We support generating up to 50 unique 3D human models across gender, skin color, and age groups, each with smooth walking or running animations.Our walker control interface provides an intuitive way to manage the virtual pedestrians in your scene. It allows selecting the walker model to generate, specifying where the walkers should be placed, setting whether they move freely or follow preset paths, and controlling the overall speed of their movement.

The simulator supports multiple types of robots, all of which correspond to the actual parameters of real robots from the Dataa Robotics. For instance, the first robot shown in the picture has 21 adjustable parameters throughout its body, including the head, neck, waist, elbow, wrist, fingers, etc. These parameters allow it to perform complex actions. The head, chest and waist are equipped with RGB-D cameras

Instructions in our Benchmark