# New algorithm identifies data subsets that will yield the most reliable predictions

##### July 25, 2014 by Larry Hardesty

Much artificial-intelligence research addresses the problem of making predictions based on large data sets. An obvious example is the recommendation engines at retail sites like Amazon and Netflix.

But some types of data are harder to collect than online click histories —information about geological formations thousands of feet underground, for instance. And in other applications—such as trying to predict the path of a storm—there may just not be enough time to crunch all the available data.

Dan Levine, an MIT graduate student in aeronautics and astronautics, and his advisor, Jonathan How, the Richard Cockburn Maclaurin Professor of Aeronautics and Astronautics, have developed a new technique that could help with both problems. For a range of common applications in which data is either difficult to collect or too time-consuming to process, the technique can identify the subset of data items that will yield the most reliable predictions. So geologists trying to assess the extent of underground petroleum deposits, or meteorologists trying to forecast the weather, can make do with just a few, targeted measurements, saving time and money.

Levine and How, who presented their work at the Uncertainty in Artificial Intelligence conference this week, consider the special case in which something about the relationships between data items is known in advance. Weather prediction provides an intuitive example: Measurements of temperature, pressure, and wind velocity at one location tend to be good indicators of measurements at adjacent locations, or of measurements at the same location a short time later, but the correlation grows weaker the farther out you move either geographically or chronologically.

Graphic content

Such correlations can be represented by something called a probabilistic graphical model. In this context, a graph is a mathematical abstraction consisting of nodes—typically depicted as circles—and edges—typically depicted as line segments connecting nodes. A network diagram is one example of a graph; a family tree is another. In a probabilistic graphical model, the nodes represent variables, and the edges represent the strength of the correlations between them.

Levine and How developed an algorithm that can efficiently calculate just how much information any node in the graph gives you about any other—what in information theory is called "mutual information." As Levine explains, one of the obstacles to performing that calculation efficiently is the presence of "loops" in the graph, or nodes that are connected by more than one path.

Calculating mutual information between nodes, Levine says, is kind of like injecting blue dye into one of them and then measuring the concentration of blue at the other. "It's typically going to fall off as we go further out in the graph," Levine says. "If there's a unique path between them, then we can compute it pretty easily, because we know what path the blue dye will take. But if there are loops in the graph, then it's harder for us to compute how blue other nodes are because there are many different paths."

So the first step in the researchers' technique is to calculate "spanning trees" for the graph. A tree is just a graph with no loops: In a , for instance, a loop might mean that someone was both parent and sibling to the same person. A spanning tree is a tree that touches all of a graph's nodes but dispenses with the edges that create loops.

Most of the nodes that remain in the graph, however, are "nuisances," meaning that they don't contain much useful information about the node of interest. The key to Levine and How's technique is a way to use those nodes to navigate the graph without letting their short-range influence distort the long-range calculation of mutual information.

That's possible, Levine explains, because the probabilities represented by the graph are Gaussian, meaning that they follow the bell curve familiar as the model of, for instance, the dispersion of characteristics in a population. A Gaussian distribution is exhaustively characterized by just two measurements: the average value—say, the average height in a population—and the variance—the rate at which the bell spreads out.

"The uncertainty in the problem is really a function of the spread of the distribution," Levine says. "It doesn't really depend on where the distribution is centered in space." As a consequence, it's often possible to calculate variance across a probabilistic graphical model without relying on the specific values of the . "The usefulness of data can be assessed before the data itself becomes available," Levine says.

Explore further: How quickly things spread

## Related Stories

February 21, 2012

Understanding the spread of infectious diseases in populations is the key to controlling them. If we were facing a flu pandemic, how could we measure where the greatest spreading risk comes from? This information could help ...

#### Explained: Graphs

December 17, 2012

When most people hear the word "graph," an image springs to mind: a pair of perpendicular lines overlaid with a line, a curve, or bars of different heights.

#### Short algorithm, long-range consequences

March 1, 2013

In the last decade, theoretical computer science has seen remarkable progress on the problem of solving graph Laplacians—the esoteric name for a calculation with hordes of familiar applications in scheduling, image processing, ...

#### Machine learning branches out

November 14, 2013

Much artificial-intelligence research is concerned with finding statistical correlations between variables: What combinations of visible features indicate the presence of a particular object in a digital image? What speech ...

#### New approach to vertex connectivity could maximize networks' bandwidth

December 23, 2013

Computer scientists are constantly searching for ways to squeeze ever more bandwidth from communications networks.

#### New algorithm can dramatically streamline solutions to the 'max flow' problem

January 7, 2014

Finding the most efficient way to transport items across a network like the U.S. highway system or the Internet is a problem that has taxed mathematicians and computer scientists for decades.

## Recommended for you

#### A ping pong robot and a mirror that 'doesn't lie' unveiled in Japan

October 7, 2015

A ping pong playing robot, a flying origami bird and a mirror that some might find a little too honest for comfort were on display at a huge tech show in Japan on Wednesday.

October 6, 2015

New findings at Oregon State University have overturned a scientific dogma that stood for decades, by showing that potassium can work with graphite in a potassium-ion battery - a discovery that could pose a challenge and ...

#### Microsoft unveils Windows 10 smartphones, new laptop (Update)

October 6, 2015

Microsoft unveiled its first Windows 10 smartphones Tuesday as it launched a series of new gadgets in a bid to win a bigger share of the competitive mobile market.

#### New auto safety technologies leave some drivers bewildered

October 6, 2015

Many Americans buying new cars these days are baffled by a torrent of new safety technology.

#### Toyota shows self-driving technology being readied for 2020 (Update)

October 6, 2015

Toyota unveiled its vision for self-driving cars in a challenge to other automakers as well as industry newcomer Google Inc., promising to start selling such vehicles in Japan by 2020.

#### Price of solar energy in the United States has fallen to 5c/kWh on average

September 30, 2015

Solar energy pricing is at an all-time low, according to a new report released by Lawrence Berkeley National Laboratory (Berkeley Lab). Driven by lower installed costs, improved project performance, and a race to build projects ...