Dr. Chao Zhang, Toshiba Europe Ltd -Multimodal learning using top-down map for embodied applications
Event details
Description
In this talk, I will cover two recent works on multimodal learning. The focus is to leverage top-down view map for downstream vision and language tasks.
In the first work, we are interested in the generation of navigation instructions, as training data for robotic navigation tasks. We propose a new approach to navigation instruction generation by framing the problem as an image captioning task, using semantic maps as visual input. Our initial investigations show promise in using semantic maps for instruction generation instead of a sequence of panorama images, but there is vast scope for improvement.
In the second part, we propose DiaLoc, a new dialog-based localization framework which aligns with a real human operator behaviour. Specifically, we produce an iterative refinement of location predictions which can visualize current pose believes after each dialog turn. DiaLoc effectively utilizes the multimodal data for multi-shot localization, where a fusion encoder fuses vision and dialog information iteratively. The method narrows the gap between simulation and real-world applications, opening doors for future research on collaborative localization and navigation.