The Who's Waldo Dataset (Homepage)
Downloading the dataset (310 GB)
Once you've recieved access (via the whoswaldo_signatures.csv file), you may download and extract our dataset (~55 GB/tar) with the following bash script.
# Make parent data directory
mkdir whos_waldo
# Download dataset splits
mkdir whos_waldo/splits
for split in train.txt val.json test.json
do curl -o whos_waldo/splits/${split} "https://whoswaldo.s3.amazonaws.com/release/splits/${split}"
done
# Download tar archives
while IFS=, read -r i key sig exp;
do curl -o whos_waldo_${i}.tar "https://whoswaldo.s3.amazonaws.com/release/whos_waldo_${i}.tar?AWSAccessKeyId=${key}&Signature=${sig}&Expires=${exp}";
done < whoswaldo_signatures.csv
# Extract the dataset, deleting archives
for i in {0..5}; do
tar xf whos_waldo_${i}.tar;
rm whos_waldo_${i}.tar;
done
whos_waldo/
├── splits/
│ ├── train.txt
│ ├── val.json
│ └── test.json
|
├── 000000/
│ ├── image.jpg
│ ├── detections.json
│ ├── caption.txt
│ ├── coreferences.json
│ ├── ground_truth.json
│ └── licenses.json
...
└── 271746
You may download dataset splits with the bash script above or with the following links (train, val, test).
train.txt: # Line-seperated list of image ids in the training set {val,test}.json: { "102990" : [2, 1, 0, 3] } # image id : ground_truth.json keys
During evaluation, we compute accuracy as an average over independent ground truth links (i.e. over each image—link pair). In other words, you should not compute accuracy per image, rather over all ground truth links.
Please refer to "Dataset Size and Splits" in Section 4 of our paper to learn more about how our splits were generated.
image.jpg : 1874 x 1500 px caption.txt : "Portola Valley, Calif., native, Maj. Gen. [NAME], Commanding general of the Multi-National Division-Baghdad briefs the new U.S. Ambassador to Iraq, [NAME] (center), on the day's plan to take a driving tour of Haifa Street and a walking tour of the Sayliah Market in central Baghdad June 26." coreferences.json : [ [[153, 159]], [[42, 48]] ] # clusters of co-referring name tokens detections.json : [{ "keypoints" : [(x, y, score), ... ] "bbox" : [x1, y1, x2, y2, score], }, ... ] # bounding boxes, COCO whole body landmarks, relative to image dimensions ground_truth.json : { "0" : 2 } # coreference idx : detection idx licenses.json : { "commons_url": https://commons.wikimedia.org/?curid=39335624, "license": "Public domain" } # "license_url" and "artist" keys are also often present
import json
with open('path/to/file.json', 'r', encoding='utf-8') as file:
data = json.load(file)
Dataset License
The images in our dataset are provided by Wikimedia Commons under various free licenses. These licenses permit the use, study, derivation, and redistribution of these images—sometimes with restrictions, e.g. requiring attribution and with copyleft. We provide source links, full license text, and attribution (when available) for all images, make no modifications to any image, and release these images under their original licenses. The associated captions are provided as a part of unstructured text in Wikimedia Commons, with rights to the original writers under the CC BY-SA 3.0 license. We modify these (as specified in our paper) and release such derivatives under the same license. We provide the rest of our dataset (i.e. detections, coreferences, and ground truth correspondences) under a CC BY-NC-SA 4.0 license.
Ethical Statement
People-centric datasets pose ethical challenges. For example, ImageNet has been scrutinized based on issues inherited from the “person” category in WordNet. Our task and dataset were created with careful attention to ethical questions, which we encountered throughout our work. Access to our dataset is provided for research purposes only and with restrictions on redistribution. Additionally, as we mask all names in captions, our dataset cannot be easily repurposed for unintended tasks, such as identification of people by name. Due to biases in our data source, we do not consider the data appropriate for developing non-research systems without further processing or augmentation. More details on distribution and intended uses are provided in a supplemental datasheet (movtivated by Datasheets for Datasets).
Citation
@InProceedings{Cui_2021_ICCV,
author = {Cui, Yuqing and Khandelwal, Apoorv and Artzi, Yoav and Snavely, Noah and Averbuch-Elor, Hadar},
title = {Who's Waldo? Linking People Across Text and Images},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021},
pages = {1374-1384}
}