The Who's Waldo Dataset (Homepage)

Who's Waldo is a dataset of 270K image–caption pairs, depicting interatctions of people, that is automatically mined from Wikimedia Commons. We are currently only distributing this dataset to academics under certain terms of use and on a per-request basis. To prevent unintended uses, we are also only distributing the versison of our dataset with masked names.
Explore Dataset
Request Access to Who's Waldo

Downloading the dataset (310 GB)

Once you've recieved access (via the whoswaldo_signatures.csv file), you may download and extract our dataset (~55 GB/tar) with the following bash script.

# Make parent data directory
mkdir whos_waldo

# Download dataset splits
mkdir whos_waldo/splits
for split in train.txt val.json test.json
    do curl -o whos_waldo/splits/${split} "${split}"

# Download tar archives
while IFS=, read -r i key sig exp;
    do curl -o whos_waldo_${i}.tar "${i}.tar?AWSAccessKeyId=${key}&Signature=${sig}&Expires=${exp}";
done < whoswaldo_signatures.csv

# Extract the dataset, deleting archives
for i in {0..5}; do
    tar xf whos_waldo_${i}.tar;
    rm whos_waldo_${i}.tar;

├── splits/
│ ├── train.txt
│ ├── val.json
│ └── test.json
├── 000000/
│ ├── image.jpg
│ ├── detections.json
│ ├── caption.txt
│ ├── coreferences.json
│ ├── ground_truth.json
│ └── licenses.json
└── 271746

You may download dataset splits with the bash script above or with the following links (train, val, test).

train.txt: # Line-seperated list of image ids in the training set
{val,test}.json: { "102990" : [2, 1, 0, 3] }  # image id : ground_truth.json keys

During evaluation, we compute accuracy as an average over independent ground truth links (i.e. over each image—link pair). In other words, you should not compute accuracy per image, rather over all ground truth links.

Please refer to "Dataset Size and Splits" in Section 4 of our paper to learn more about how our splits were generated.

image.jpg : 1874 x 1500 px

caption.txt : "Portola Valley, Calif., native, Maj. Gen. [NAME], Commanding general of the Multi-National Division-Baghdad briefs the new U.S. Ambassador to Iraq, [NAME] (center), on the day's plan to take a driving tour of Haifa Street and a walking tour of the Sayliah Market in central Baghdad June 26."

coreferences.json : [ [[153, 159]], [[42, 48]] ]  # clusters of co-referring name tokens

detections.json : [{
    "keypoints" : [(x, y, score), ... ]
    "bbox" : [x1, y1, x2, y2, score],
}, ... ]  # bounding boxes, COCO whole body landmarks, relative to image dimensions

ground_truth.json : { "0" : 2 }  # coreference idx : detection idx

licenses.json : {
    "license": "Public domain"
}   # "license_url" and "artist" keys are also often present    

As Who's Waldo includes textual data from a real-world (i.e. messy) data source, it is important to encode all strings to "UTF-8" to properly handle special characters. We recommend using the following Python 3 code to read files:

import json

with open('path/to/file.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

Dataset License

The images in our dataset are provided by Wikimedia Commons under various free licenses. These licenses permit the use, study, derivation, and redistribution of these images—sometimes with restrictions, e.g. requiring attribution and with copyleft. We provide source links, full license text, and attribution (when available) for all images, make no modifications to any image, and release these images under their original licenses. The associated captions are provided as a part of unstructured text in Wikimedia Commons, with rights to the original writers under the CC BY-SA 3.0 license. We modify these (as specified in our paper) and release such derivatives under the same license. We provide the rest of our dataset (i.e. detections, coreferences, and ground truth correspondences) under a CC BY-NC-SA 4.0 license.

Ethical Statement

People-centric datasets pose ethical challenges. For example, ImageNet has been scrutinized based on issues inherited from the “person” category in WordNet. Our task and dataset were created with careful attention to ethical questions, which we encountered throughout our work. Access to our dataset is provided for research purposes only and with restrictions on redistribution. Additionally, as we mask all names in captions, our dataset cannot be easily repurposed for unintended tasks, such as identification of people by name. Due to biases in our data source, we do not consider the data appropriate for developing non-research systems without further processing or augmentation. More details on distribution and intended uses are provided in a supplemental datasheet (movtivated by Datasheets for Datasets).


    author    = {Cui, Yuqing and Khandelwal, Apoorv and Artzi, Yoav and Snavely, Noah and Averbuch-Elor, Hadar},
    title     = {Who's Waldo? Linking People Across Text and Images},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {1374-1384}