Dr. Chao Zhang, Toshiba Europe Ltd -Multimodal learning using top-down map for embodied applications

Event details

Friday 26 April 2024

10:00am

Room 108 (Ada Lovelace meeting room), Regent Court, The University of Sheffield, Regent Court (DCS), The University of Sheffield, 211 Portobello, Sheffield, S1 4DP

Description

In this talk, I will cover two recent works on multimodal learning. The focus is to leverage top-down view map for downstream vision and language tasks.

In the first work, we are interested in the generation of navigation instructions, as training data for robotic navigation tasks. We propose a new approach to navigation instruction generation by framing the problem as an image captioning task, using semantic maps as visual input. Our initial investigations show promise in using semantic maps for instruction generation instead of a sequence of panorama images, but there is vast scope for improvement.

In the second part, we propose DiaLoc, a new dialog-based localization framework which aligns with a real human operator behaviour. Specifically, we produce an iterative refinement of location predictions which can visualize current pose believes after each dialog turn. DiaLoc effectively utilizes the multimodal data for multi-shot localization, where a fusion encoder fuses vision and dialog information iteratively. The method narrows the gap between simulation and real-world applications, opening doors for future research on collaborative localization and navigation.

Department of Computer Science

Department of Computer Science