Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models

Bryan A. Plummer1    Liwei Wang1    Chris M. Cervantes1    Juan C. Caicedo2    
Julia Hockenmaier1    Svetlana Lazebnik1

1University of Illinois at Urbana Champaign         2Fundación Universitaria Konrad Lorenz


The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.

Dataset Examples:

In each group of captions describing the same image, coreferent mentions (coreference chains) and their corresponding bounding boxes are marked with the same color. In the left example, each chain points to a single entity (bounding box). Scenes and events like "outside" or "parade" have no box. In the middle example, the people (red) and flags (blue) chains point to multiple boxes each. On the right, blue phrases refer to the bride, and red phrases refer to the groom. The dark purple phrases ("a couple") refer to both of these entities, and their corresponding bounding boxes are identical to the red and blue ones.

You can browse additional examples of our dataset at: [Examples]

Dataset:

Please fill in this form to request access to the Flickr30k Entities Dataset. The annotations are in XML format and the size of the archive is 11MB. Instructions to obtain access will be automatically emailed immediately after a request is made.

Please visit the website for the original Flickr30k Dataset to obtain the images for the dataset. [Flickr30k]

Reference:

We have submitted a journal version of our paper with a stronger baseline on the phrase localization task:

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik, Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models, IJCV, 2016, Submitted. [arXiv link]

We used the following code to evaluate phrase localization: code

Original paper:

Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik, Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models, ICCV, 2015. [arXiv link] [Supplementary Material]

If you use our annotations please cite both the above paper and the original Flickr30k Dataset:

Peter Young, Alice Lai, Micah Hodosh and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Transactions of the Association for Computational Linguistics, 2(Feb):67-78, 2014. [pdf]

Note that the Flickr30k Dataset includes images obtained from Flickr. Use of the images must abide by the Flickr Terms of Use. We do not own the copyright of the images. They are solely provided for researchers and educators who wish to use the dataset for non-commercial research and/or educational purposes.

Non-English Captions:

While our extension of Flickr30K uses the original English captions, others have extended the dataset to include captions in different languages which may be of interest to researchers.

German captions and translations
Chinese captions (Flickr8K)

Acknowledgements:

This material is based upon work supported by the National Science Foundation under Grants No. 1053856, 1205627, 1405883, IIS-1228082, and CIF-1302438 as well as support from Xerox UAC and the Sloan Foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or any sponsor.

We thank the NVIDIA Corporation for the generous donation of the GPUs used for our experiments.

Please direct any questions to bplumme2 -at- illinois dot edu