Grounding language to vision is a fundamental problem for many real-world AI systems such as retrieving images or generating descriptions for the visually impaired. Success on these tasks requires models to relate different aspects of language such as objects and verbs to images. For example, to distinguish between the two images in the middle column…
