March 17, 2016

Paying attention to words not just images leads to better image captions

A team of University and Adobe researchers is outperforming other approaches to creating computer-generated image captions in an international competition. The key to their winning approach? Thinking about words - what they mean and how they fit in a sentence structure - just as much as thinking about the image itself.

The Rochester/Adobe model mixes the two approaches that are often used in image captioning: the "top-down" approach, which starts from the "gist" of the image and then converts it into words, and the "bottom-up" approach, which first assigns words to different aspects of the image and then combines them together to form a sentence.

The Rochester/Adobe model is currently beating Google, Microsoft, Baidu/UCLA, Stanford University, University of California Berkeley, University of Toronto/Montreal, and others to top the leaderboard in an image captioning competition run by Microsoft, called the Microsoft COCO Image Captioning Challenge. While the winner of the year-long competition is still to be determined, the Rochester "Attention" system - or ATT on the leaderboard - has been leading the field since last November.

Other groups have also tried to combine these two methods by having a feedback mechanism that allows a system to improve on what just one of the approaches would be able to do. However, several systems that tried to blend these two approaches focused on "visual attention," which tries to take into account which parts of an image are visually more important to describe the image better.

The Rochester/Adobe system focuses on what the researchers describe as "semantic attention." In a paper accepted by the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), entitled "Image Captioning with Semantic Attention," computer science professor Jiebo Luo and his colleagues define semantic attention as "the ability to provide a detailed, coherent description of semantically important objects that are needed exactly when they are needed."

"To describe an image you need to decide what to pay more attention to," said Luo. "It is not only about what is in the center of the image or a bigger object, it's also about coming up with a way of deciding on the importance of specific words."

For example, take an image that shows a table and seated people. The table might be at the center of the image but a better caption might be "a group of people sitting around a table" instead of "a table with people seated." Both are correct, but the former one also tries to take into account what might be of interest to readers and viewers.

Computer image captioning brings together two key areas in artificial intelligence: computer vision and natural language processing. For the computer vision side, researchers train their systems on a massive dataset of images, so they learn to identify objects in images. Language models can then be used to put these words together. For the algorithm that Luo and his team used in their system, they also trained their system on many texts. The objective was not only to understand sentence structure but also the meanings of individual words, what words often get used together with these words, and what words might be semantically more important.

A closely related paper on video captioning by Luo, graduate student Yuncheng Li, and their Yahoo Research colleagues Yale Song, Liangliang Cao, Joel Tetreault, andLarry Goldberg. "TGIF: A New Dataset and Benchmark on Animated GIF Description," will also be featured as a "Spotlight" presentation at CVPR.

Provided by University of Rochester

Citation: Paying attention to words not just images leads to better image captions (2016, March 17) retrieved 29 June 2024 from https://phys.org/news/2016-03-attention-words-images-image-captions.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Microsoft Research project can interpret, caption photos

4 shares

Feedback to editors

The Milky Way's eROSITA bubbles are large and distant

8 hours ago

Saturday Citations: Armadillos are everywhere; Neanderthals still surprising anthropologists; kids are egalitarian

8 hours ago

NASA astronauts will stay at the space station longer for more troubleshooting of Boeing capsule

12 hours ago

The beginnings of fashion: Paleolithic eyed needles and the evolution of dress

Jun 28, 2024

Analysis of NASA InSight data suggests Mars hit by meteoroids more often than thought

Jun 28, 2024

New computational microscopy technique provides more direct route to crisp images

Jun 28, 2024

A harmless asteroid will whiz past Earth Saturday. Here's how to spot it

Jun 28, 2024

Tiny bright objects discovered at dawn of universe baffle scientists

Jun 28, 2024

New method for generating monochromatic light in storage rings

Jun 28, 2024

Soft, stretchy electrode simulates touch sensations using electrical signals

Jun 28, 2024

Load comments (0)

Paying attention to words not just images leads to better image captions

The Milky Way's eROSITA bubbles are large and distant

Saturday Citations: Armadillos are everywhere; Neanderthals still surprising anthropologists; kids are egalitarian

NASA astronauts will stay at the space station longer for more troubleshooting of Boeing capsule

The beginnings of fashion: Paleolithic eyed needles and the evolution of dress

Analysis of NASA InSight data suggests Mars hit by meteoroids more often than thought

New computational microscopy technique provides more direct route to crisp images

A harmless asteroid will whiz past Earth Saturday. Here's how to spot it

Tiny bright objects discovered at dawn of universe baffle scientists

New method for generating monochromatic light in storage rings

Soft, stretchy electrode simulates touch sensations using electrical signals

Relevant PhysicsForums posts

Who can find the largest prime number with their own programmed code?

Math Major Trying to Learn CS

Parallelizing N-Queens

How to test locally hosted websites on mobile?

Question about learning programming

Why do emails from my contact form bounce?

Microsoft Research project can interpret, caption photos

A program that captions your photos

Computers can perceive image curves like artists

Image descriptions from computers show gains

A picture is worth 1000 words, but how many emotions?

Study suggests humans and computers use different processes to identify objects visually

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

Paying attention to words not just images leads to better image captions

The Milky Way's eROSITA bubbles are large and distant

Saturday Citations: Armadillos are everywhere; Neanderthals still surprising anthropologists; kids are egalitarian

NASA astronauts will stay at the space station longer for more troubleshooting of Boeing capsule

The beginnings of fashion: Paleolithic eyed needles and the evolution of dress

Analysis of NASA InSight data suggests Mars hit by meteoroids more often than thought

New computational microscopy technique provides more direct route to crisp images

A harmless asteroid will whiz past Earth Saturday. Here's how to spot it

Tiny bright objects discovered at dawn of universe baffle scientists

New method for generating monochromatic light in storage rings

Soft, stretchy electrode simulates touch sensations using electrical signals

Relevant PhysicsForums posts

Related Stories

Microsoft Research project can interpret, caption photos

A program that captions your photos

Computers can perceive image curves like artists

Image descriptions from computers show gains

A picture is worth 1000 words, but how many emotions?

Study suggests humans and computers use different processes to identify objects visually

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience