Information theory holds surprises for machine learning

January 24, 2019, Santa Fe Institute
Credit: CC0 Public Domain

New SFI research challenges a popular conception of how machine learning algorithms "think" about certain tasks.

The conception goes something like this: because of their ability to discard useless , a class of algorithms called can learn general concepts from raw data— like identifying cats generally after encountering tens of thousands of images of different cats in different situations. This seemingly human ability is said to arise as a byproduct of the networks' layered architecture. Early layers encode the "cat" label along with all of the raw information needed for . Subsequent layers then compress the information, as if through a bottleneck. Irrelevant data, like the color of the cat's coat, or the saucer of milk beside it, is forgotten, leaving only general features behind. Information theory provides bounds on just how optimal each layer is, in terms of how well it can balance the competing demands of and prediction.

"A lot of times when you have a neural network and it learns to map faces to names, or pictures to numerical digits, or amazing things like French text to English text, it has a lot of intermediate hidden layers that information flows through," says Artemy Kolchinsky, an SFI Postdoctoral Fellow and the study's lead author. "So there's this long-standing idea that as raw inputs get transformed to these intermediate representations, the system is trading prediction for compression, and building higher-level concepts through this information bottleneck."

However, Kolchinsky and his collaborators Brendan Tracey (SFI, MIT) and Steven Van Kuyk (University of Wellington) uncovered a surprising weakness when they applied this explanation to common classification problems, where each input has one correct output (e.g., in which each picture can either be of a cat or of a dog). In such cases, they found that classifiers with many layers generally do not give up some prediction for improved compression. They also found that there are many "trivial" representations of the inputs which are, from the point of view of , optimal in terms of their balance between prediction and compression.

"We found that this information bottleneck measure doesn't see compression in the same way you or I would. Given the choice, it is just as happy to lump 'martini glasses' in with 'Labradors', as it is to lump them in with 'champagne flutes,'" Tracey explains. "This means we should keep searching for compression measures that better match our notions of compression."

While the idea of compressing inputs may still play a useful role in machine learning, this research suggests it is not sufficient for evaluating the internal representations used by different machine learning algorithms.

At the same time, Kolchinsky says that the concept of trade-off between compression and prediction will still hold for less deterministic tasks, like predicting the weather from a noisy dataset. "We're not saying that information bottleneck is useless for supervised [machine] learning," Kolchinsky stresses. "What we're showing here is that it behaves counter-intuitively on many common machine learning problems, and that's something people in the machine learning community should be aware of."

Explore further: A new approach for software fault prediction using feature selection

More information: Caveats for information bottleneck in deterministic scenarios. export.arxiv.org/abs/1808.07593

Related Stories

Measuring AI's ability to learn is difficult

January 17, 2019

Organizations looking to benefit from the artificial intelligence (AI) revolution should be cautious about putting all their eggs in one basket, a study from the University of Waterloo has found.

EMR data can predict myopia development

November 9, 2018

(HealthDay)—Big data and machine learning approaches can improve prediction of myopia in Chinese children, according to a study published online Nov. 6 in PLOS Medicine.

Recommended for you

In colliding galaxies, a pipsqueak shines bright

February 20, 2019

In the nearby Whirlpool galaxy and its companion galaxy, M51b, two supermassive black holes heat up and devour surrounding material. These two monsters should be the most luminous X-ray sources in sight, but a new study using ...

When does one of the central ideas in economics work?

February 20, 2019

The concept of equilibrium is one of the most central ideas in economics. It is one of the core assumptions in the vast majority of economic models, including models used by policymakers on issues ranging from monetary policy ...

Research reveals why the zebra got its stripes

February 20, 2019

Why do zebras have stripes? A study published in PLOS ONE today takes us another step closer to answering this puzzling question and to understanding how stripes actually work.

Correlated nucleons may solve 35-year-old mystery

February 20, 2019

A careful re-analysis of data taken at the Department of Energy's Thomas Jefferson National Accelerator Facility has revealed a possible link between correlated protons and neutrons in the nucleus and a 35-year-old mystery. ...

0 comments

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.