January 22, 2018

Microsoft researchers build a bot that draws what you tell it to

by John Roach, Microsoft

If you're handed a note that asks you to draw a picture of a bird with a yellow body, black wings and a short beak, chances are you'll start with a rough outline of a bird, then glance back at the note, see the yellow part and reach for a yellow pen to fill in the body, read the note again and reach for a black pen to draw the wings and, after a final check, shorten the beak and define it with a reflective glint. Then, for good measure, you might sketch a tree branch where the bird rests.

Now, there's a bot that can do that, too.

The new artificial intelligence technology under development in Microsoft's research labs is programmed to pay close attention to individual words when generating images from caption-like text descriptions. This deliberate focus produced a nearly three-fold boost in image quality compared to the previous state-of-the-art technique for text-to-image generation, according to results on an industry standard test reported in a research paper posted on arXiv.org.

The technology, which the researchers simply call the drawing bot, can generate images of everything from ordinary pastoral scenes, such as grazing livestock, to the absurd, such as a floating double-decker bus. Each image contains details that are absent from the text descriptions, indicating that this artificial intelligence contains an artificial imagination.

"If you go to Bing and you search for a bird, you get a bird picture. But here, the pictures are created by the computer, pixel by pixel, from scratch," said Xiaodong He, a principal researcher and research manager in the Deep Learning Technology Center at Microsoft's research lab in Redmond, Washington. "These birds may not exist in the real world—they are just an aspect of our computer's imagination of birds."

The drawing bot closes a research circle around the intersection of computer vision and natural language processing that He and colleagues have explored for the past half-decade. They started with technology that automatically writes photo captions – the CaptionBot – and then moved to a technology that answers questions humans ask about images, such as the location or attributes of objects, which can be especially helpful for blind people.

These research efforts require training machine learning models to identify objects, interpret actions and converse in natural language.

"Now we want to use the text to generate the image," said Qiuyuan Huang, a postdoctoral researcher in He's group and a paper co-author. "So, it is a cycle."

Image generation is a more challenging task than image captioning, added Pengchuan Zhang, an associate researcher on the team, because the process requires the drawing bot to imagine details that are not contained in the caption. "That means you need your machine learning algorithms running your artificial intelligence to imagine some missing parts of the images," he said.

Attentive image generation

At the core of Microsoft's drawing bot is a technology known as a Generative Adversarial Network, or GAN. The network consists of two machine learning models, one that generates images from text descriptions and another, known as a discriminator, that uses text descriptions to judge the authenticity of generated images. The generator attempts to get fake pictures past the discriminator; the discriminator never wants to be fooled. Working together, the discriminator pushes the generator toward perfection.

Microsoft's drawing bot was trained on datasets that contain paired images and captions, which allow the models to learn how to match words to the visual representation of those words. The GAN, for example, learns to generate an image of a bird when a caption says bird and, likewise, learns what a picture of a bird should look like. "That is a fundamental reason why we believe a machine can learn," said He.

GANs work well when generating images from simple text descriptions such as a blue bird or an evergreen tree, but the quality stagnates with more complex text descriptions such as a bird with a green crown, yellow wings and a red belly. That's because the entire sentence serves as a single input to the generator. The detailed information of the description is lost. As a result, the generated image is a blurry greenish-yellowish-reddish bird instead a close, sharp match with the description.

As humans draw, we repeatedly refer to the text and pay close attention to the words that describe the region of the image we are drawing. To capture this human trait, the researchers created what they call an attentional GAN, or AttnGAN, that mathematically represents the human concept of attention. It does this by breaking up the input text into individual words and matching those words to specific regions of the image.

"Attention is a human concept; we use math to make attention computational," explained He.

The model also learns what humans call commonsense from the training data, and it pulls on this learned notion to fill in details of images that are left to the imagination. For example, since many images of birds in the training data show birds sitting on tree branches, the AttnGAN usually draws birds sitting on branches unless the text specifies otherwise.

"From the data, the machine learning algorithm learns this commonsense where the bird should belong," said Zhang. As a test, the team fed the drawing bot captions for absurd images, such as "a red double-decker bus is floating on a lake." It generated a blurry, drippy image that resembles both a boat with two decks and a double-decker bus on a lake surrounded by mountains. The image suggests the bot had an internal struggle between knowing that boats float on lakes and the text specification of bus.

"We can control what we describe and see how the machine reacts," explained He. "We can poke and test what the machine learned. The machine has some background learned commonsense, but it can still follow what you ask and maybe, sometimes, it seems a bit ridiculous."

Practical applications

Text-to-image generation technology could find practical applications acting as a sort of sketch assistant to painters and interior designers, or as a tool for voice-activated photo refinement. With more computing power, He imagines the technology could generate animated films based on screenplays, augmenting the work that animated filmmakers do by removing some of the manual labor involved.

For now, the technology is imperfect. Close examination of images almost always reveals flaws, such as birds with blue beaks instead of black and fruit stands with mutant bananas. These flaws are a clear indication that a computer, not a human, created the images. Nevertheless, the quality of the AttnGAN images are a nearly three-fold improvement over the previous best-in-class GAN and serve as a milestone on the road toward a generic, human-like intelligence that augments human capabilities, according to He.

"For AI and humans to live in the same world, they have to have a way to interact with each other," explained He. "And language and vision are the two most important modalities for humans and machines to interact with each other."

More information: AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. arXiv:1711.10485 [cs.CV] arxiv.org/abs/1711.10485

Provided by Microsoft

Citation: Microsoft researchers build a bot that draws what you tell it to (2018, January 22) retrieved 17 July 2024 from https://phys.org/news/2018-01-microsoft-bot.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Microsoft Research project can interpret, caption photos

41 shares

Feedback to editors

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

8 hours ago

Intensive farming could raise risk of new pandemics, researchers warn

9 hours ago

Scientists develop new AI method to create material 'fingerprints'

12 hours ago

Study shows frogs can quickly increase their tolerance to pesticides

12 hours ago

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

12 hours ago

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

13 hours ago

Scientists use machine learning to predict diversity of tree species in forests

14 hours ago

Physicists pool skills to better describe the unstable sigma meson particle

15 hours ago

Telescope tag-team discovers 10 strange and exotic pulsars

15 hours ago

NASA transmits hip-hop song to deep space for first time

15 hours ago

Load comments (0)

Microsoft researchers build a bot that draws what you tell it to

Attentive image generation

Practical applications

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Particle.js: Exploring Particle Physics with Web Technologies

Help solving a geometrical matching issue with Graph Neural Networks

5 GHz PC WiFi connection Cybersecurity question

Help with some optimization code for Block Matrices

Is an API Always Necessary for Server-Client Communication?

I did this POST message configuration damage to my wifi internet, help

Microsoft Research project can interpret, caption photos

Image descriptions from computers show gains

AI method to upscale low-resolution images to high-resolution

Paying attention to words not just images leads to better image captions

Making interaction with AI systems more natural with textual grounding

Apple AI research paper is from vision expert and team

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

Microsoft researchers build a bot that draws what you tell it to

Attentive image generation

Practical applications

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Related Stories

Microsoft Research project can interpret, caption photos

Image descriptions from computers show gains

AI method to upscale low-resolution images to high-resolution

Paying attention to words not just images leads to better image captions

Making interaction with AI systems more natural with textual grounding

Apple AI research paper is from vision expert and team

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience