July 21, 2017

Computers using linguistic clues to deduce photo content

Scientists at Disney Research and the University of California, Davis have found that the way a person describes the content of a photo can provide important clues for computer vision programs to determine where various things appear in the image.

According to Leonid Sigal, a senior research scientist at Disney Research, it's not just the words, but the sentence structure of a caption that can help a computer determine where in an image a particular object or action is depicted. By parsing the sentence and applying deep learning techniques, the computer can use the hierarchy of the sentence to better understand spatial relationships and associate each phrase with the appropriate part of the image.

A neural network based on this approach potentially could automate the process of annotating images that subsequently can be used to train visual recognition programs. The researchers, including Fanyi Xiao and Yong Jae Lee of UC Davis, will present their findings at the IEEE Conference on Computer Vision and Pattern Recognition on July 22 in Honolulu.

"We've seen tremendous progress in the ability of computers to detect and categorize objects, to understand scenes and even to write basic captions, but these capabilities have been developed largely by training computer programs with huge numbers of images that have been carefully and laboriously labeled as to their content," said Markus Gross, vice president at Disney Research. "As computer vision applications tackle increasingly complex problems, creating these large training data sets has become a serious bottleneck."

Using just a little bit of labeled data to generate these large training sets has been a goal of researchers for years and the approach by the Disney and UC Davis scientists may be the first to leverage sentence structure in doing so.

The phrase "a grey cat staring at a hand with a donut," for instance, suggests that a hand and a donut will appear together while "staring" suggests that the grey cat should be spatially disjointed from the hand with the donut.

Xiao said recognizing these constraints - natural language that indicates which things are together and which are apart - provides important context that enables the neural network to produce more accurate visual localizations for language inputs at all levels (words, phrase and sentence).

Different language inputs thus will provide different results for the same image. In a photo of a park, the phrase "girl sits on bench" results in the computer highlighting a girl sitting, while "bench is grey stone" highlights just the stone end of the bench, without highlighting the girl.

In testing this approach with existing visual data sets, the researchers showed their system produced more accurate localizations than baseline systems that do not consider the structure of natural language. "While mainstream weakly-supervised localization approaches have used image tags as the source of supervision, our work instead uses captions and is thus able to exploit the rich structure in language. We hope this work will inspire more research in this direction." said Yong Jae.

Combining creativity and innovation, this research continues Disney's rich legacy of leveraging technology to enhance the tools and systems of tomorrow.

More information: "Weakly-Supervised Visual Grounding of Phrases with Linguistic Structures-Paper" [PDF, 2.46 MB]

Provided by Disney Research

Citation: Computers using linguistic clues to deduce photo content (2017, July 21) retrieved 17 July 2024 from https://phys.org/news/2017-07-linguistic-clues-deduce-photo-content.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Paying attention to words not just images leads to better image captions

14 shares

Feedback to editors

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

8 hours ago

Intensive farming could raise risk of new pandemics, researchers warn

9 hours ago

Scientists develop new AI method to create material 'fingerprints'

12 hours ago

Study shows frogs can quickly increase their tolerance to pesticides

13 hours ago

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

13 hours ago

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

13 hours ago

Scientists use machine learning to predict diversity of tree species in forests

14 hours ago

Physicists pool skills to better describe the unstable sigma meson particle

16 hours ago

Telescope tag-team discovers 10 strange and exotic pulsars

16 hours ago

NASA transmits hip-hop song to deep space for first time

16 hours ago

Load comments (0)

Computers using linguistic clues to deduce photo content

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Particle.js: Exploring Particle Physics with Web Technologies

Help solving a geometrical matching issue with Graph Neural Networks

5 GHz PC WiFi connection Cybersecurity question

Help with some optimization code for Block Matrices

Is an API Always Necessary for Server-Client Communication?

I did this POST message configuration damage to my wifi internet, help

Paying attention to words not just images leads to better image captions

Neural nets model audience reactions to movies

Computer vision system studies word use to recognize objects it has never seen before

Cow goes moo: Artificial intelligence-based system associates images with sounds

Object and scene recognition software work together to understand video content

Computer learns to recognize sounds by watching video

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

Computers using linguistic clues to deduce photo content

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Related Stories

Paying attention to words not just images leads to better image captions

Neural nets model audience reactions to movies

Computer vision system studies word use to recognize objects it has never seen before

Cow goes moo: Artificial intelligence-based system associates images with sounds

Object and scene recognition software work together to understand video content

Computer learns to recognize sounds by watching video

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience