Restoring balance in machine learning datasets

October 11, 2018 by Giovanni Mariani, IBM
Five representative samples for each class (row) in the CIFAR-10 dataset. For each class, these samples are obtained with generative models trained after dropping from the training set 40% of the images of that specific class. Credit: IBM

If you want to teach a child what an elephant looks like, you have an infinite number of options. Take a photo from National Geographic, a stuffed animal of Dumbo, or an elephant keychain; show it to the child; and the next time he sees an object which looks like an elephant he will likely point and say the word.

Teaching AI what an elephant looks like is a bit different. To train a machine learning algorithm, you will likely need thousands of elephant images using different perspectives, such as head, tail, and profile. But then, even after ingesting thousands of photos, if you connect your algorithm to a camera and show it a pink elephant keychain, it likely won't recognize it as an elephant.

This is a form of data bias, and it often negatively affects the accuracy of classifiers. To fix this bias, using the same example, we would need at least 50-100 images of pink , which could be problematic since pink elephants are "rare".

This is a known challenge in machine learning communities, and whether its pink elephants or road signs, small data sets present big challenges for AI scientists.

Restoring balance for training AI

Since earlier this year, my colleagues and I at IBM Research in Zurich are offering a solution. It's called BAGAN, or balancing generative adversarial networks, and it can generate completely new images, i.e. of pink elephants, to restore balance for training AI.

Five representative samples generated for the three most represented majority classes in the GT- SRB dataset. Credit: IBM
Seeing is believing

In the paper we report using BAGAN on the German Traffic Sign Recognition Benchmark, as well as on MNIST and CIFAR-10, and when compared against state-of-the-art GAN, the methodology outperforms all of them in terms of variety and quality of the generated images when the training dataset is imbalanced. In turn, this leads to a higher accuracy of final classifiers trained on the augmented dataset.

Five representative samples generated for the three least represented minority classes in the GT-SRB dataset. Credit: IBM

Explore further: Baby elephant joins herd at San Diego Zoo Safari Park

More information: BAGAN: Data Augmentation with Balancing GAN. Giovanni Mariani, Florian Scheidegger, Roxana Istrate, Costas Bekas, and Cristiano Malossi. arxiv.org/abs/1803.09655

The work was recently published and made open-source. Visit Github today to try it for free github.com/IBM/BAGAN

Related Stories

Training artificial intelligence with artificial X-rays

July 6, 2018

Artificial intelligence (AI) holds real potential for improving both the speed and accuracy of medical diagnostics. But before clinicians can harness the power of AI to identify conditions in images such as X-rays, they have ...

Recommended for you

Permanent, wireless self-charging system using NIR band

October 8, 2018

As wearable devices are emerging, there are numerous studies on wireless charging systems. Here, a KAIST research team has developed a permanent, wireless self-charging platform for low-power wearable electronics by converting ...

Facebook launches AI video-calling device 'Portal'

October 8, 2018

Facebook on Monday launched a range of AI-powered video-calling devices, a strategic revolution for the social network giant which is aiming for a slice of the smart speaker market that is currently dominated by Amazon and ...

Artificial enzymes convert solar energy into hydrogen gas

October 4, 2018

In a new scientific article, researchers at Uppsala University describe how, using a completely new method, they have synthesised an artificial enzyme that functions in the metabolism of living cells. These enzymes can utilize ...

0 comments

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.