Who's Waldo?

Linking People Across Text and Images

Claire Yuqing Cui * Apoorv Khandelwal * Yoav Artzi Noah Snavely Hadar Averbuch-Elor

Oral Presentation @ ICCV 2021
* Equal Contribution

Sam Schulz passes to Curtly Hampton during the UWS Giants vs Eastlake NEAFL match at the Robertson Oval on 1 August 2015.

Justyna Kowalczyk, Kikkan Randall and Ingvild Flugstad Østberg at the Royal Palace Sprint, part of the FIS World Cup 2012/2013, in Stockholm on March 20, 2013. Kikkan Randall won the sprint cup.

You may not recognize them, but can you identify the people in these captions?
(Hover over images to see answers!)

Above we show two image–caption pairs capturing interactions between people. Possible cues revealing the correspondence between names in the captions and people in the images include:

(i) the action between two people ("Sam passes to Curtly"),
(ii) "Kikkan Randall won the sprint cup", so she is the one holding the trophy, and
(iii) there happens to be a left-to-right ordering of people in the right-hand example—this happens frequently in real-life images and captions.

In this work, we present a new task, person-centric visual grounding, which features the challenge above. We also provide a benchmark dataset of 270K image-caption pairs and propose a Transformer-based method for this task.


We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image–caption pairs to focus on contextual cues (such as rich interactions between multiple people), rather than learning associations between names and appearances. To facilitate this task, we introduce a new dataset, Who's Waldo, mined automatically from image–caption data on Wikimedia Commons. We propose a Transformer-based method that outperforms several strong baselines on this task, and are releasing our data to the research community to spur work on contextual models that consider both vision and language.


This work was supported by the National Science Foundation (IIS-2008313, CAREER-1750499), a Google Focused Award, the generosity of Eric & Wendy Schmidt by recommendation of the the Schmidt Futures program, and the Zuckerman STEM Leadership Program.


    author    = {Claire Yuqing Cui and Apoorv Khandelwal and Yoav Artzi and Noah Snavely and Hadar Averbuch-Elor},
    title     = {Who's Waldo? Linking People Across Text and Images},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {1374-1384}