A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution

Supplementary Material Website

Model Behavior Examples

The following videos show our agent executing various instructions. Each example shows:

Top-left: The input RGB image with interaction action masks overlaid during timesteps when the agent performs an interaction action. The interaction action argument mask is computed as the intersection of the two masks in middle-right and bottom-right positions.

Bottom-left: Predicted segmentation and depth used to build the semantic voxel map.

Center: Semantic voxel map. Different colors indicate different classes of objects. The brighter colors represent voxels that are observed in the current timestep. The more washed-out colors represent voxels remembered from previous timesteps. Agent position is represented as a black pillar. The current navigation goal is shown as a red pillar. The 3D argument mask of the current subgoal is shown in bright yellow. These three special voxel classes are not part of the semantc voxel map, but are included here for visualization only. The text overlays show the instruction input, the current subgoal, the current action, and the state of the low-level controller (either exploring or interacting).

Top-right: Value iteration network value function in the birds-eye view. White pixels are the goal location of value 1. Black pixels are obstacles of value -1. Other pixels are free or unobserved space with varying values between 0 and 1.

Middle-right: Mask identifying all pixels that belong to the current subgoal argument class according to the most recent segmentation image.

Bottom-right: The 3D subgoal argument mask projected in the first-person view.

Example 1 - Secure two discs in a bedroom safe

Result: Success

Example 2 - Place the sliced apple on the counter

Result: Failure

Predicted vs Ground-truth Segmentation and Depth

Example 3A - Put a clean bar of soap in a cabinet

Result: Success

This example uses learned monocular and depth models as required by the ALFRED benchmark.

Example 3B (ground truth depth and segmentation) - Put a clean bar of soap in a cabinet

Result: Failure

This example provides ground truth depth and segmentation images at test-time.

Perfect depth and segmentation in Example B results in clear semantic voxel maps. As a result, the agent knows perfectly the location of every observed object, and succeeds in most interaction actions with the first attempt. However, the agent loses track of which sink the soap bar was placed in and toggles the wrong faucet resulting in task failure.

In Example 3A, the semantic voxel map is built from predicted monocular depth and segmentation, and as a result is a more noisy. However, it still facilitates the necessary reasoning to complete the task. In this case, the agent sometimes has to re-try a couple of times before succeeding in each interaction action due to uncertainty about locations of various objects.