Algorithm reduces size of data sets while preserving their mathematical properties

May 20, 2015 by Larry Hardesty, Massachusetts Institute of Technology
Credit: Jose-Luis Olivares/MIT

As anyone who's ever used a spreadsheet can attest, it's often convenient to organize data into tables. But in the age of big data, those tables can be enormous, with millions or even hundreds of millions of rows.

One way to make big-data analysis computationally practical is to reduce the size of data tables—or matrices, to use the mathematical term—by leaving out a bunch of rows. The trick is that the remaining rows have to be in some sense representative of the ones that were omitted, in order for computations performed on them to yield approximately the right results.

At the ACM Symposium on Theory of Computing in June, MIT researchers will present a new algorithm that finds the smallest possible approximation of the original matrix that guarantees reliable computations. For a class of problems important in engineering and machine learning, this is a significant improvement over previous techniques. And for all classes of problems, the algorithm finds the approximation as quickly as possible.

In order to determine how well a given row of the condensed matrix represents a row of the original matrix, the algorithm needs to measure the "distance" between them. But there are different ways to define "distance."

One common way is so-called "Euclidean distance." In Euclidean distance, the differences between the entries at corresponding positions in the two rows are squared and added together, and the distance between rows is the square root of the resulting sum. The intuition is that of the Pythagorean theorem: The square root of the sum of the squares of the lengths of a right triangle's legs gives the length of the hypotenuse.

Another measure of distance is less common but particularly useful in solving machine-learning and other optimization problems. It's called "Manhattan distance," and it's simply the sum of the absolute differences between the corresponding entries in the two rows.

Inside the norm

In fact, both Manhattan distance and Euclidean distance are instances of what statisticians call "norms." The Manhattan distance, or 1-norm, is the first root of the sum of differences raised to the first power, and the Euclidean distance, or 2-norm, is the square root of the sum of differences raised to the second power. The 3-norm is the cube root of the sum of differences raised to the third power, and so on to infinity.

In their paper, the MIT researchers—Richard Peng, a postdoc in applied mathematics, and Michael Cohen, a graduate student in electrical engineering and computer science—demonstrate that their algorithm is optimal for condensing matrices under any norm. But according to Peng, "The one we really cared about was the 1-norm."

In matrix condensation—under any norm—the first step is to assign each row of the original matrix a "weight." A row's weight represents the number of other rows that it's similar to, and it determines the likelihood that the row will be included in the condensed matrix. If it is, its values will be multiplied according to its weight. So, for instance, if 10 rows are good stand-ins for each other, but not for any other rows of the matrix, each will have a 10 percent chance of getting into the condensed matrix. If one of them does, its entries will all be multiplied by 10, so that it will reflect the contribution of the other nine rows it's standing in for.

Although Manhattan distance is in some sense simpler than Euclidean distance, it makes calculating rows' weights more difficult. Previously, the best algorithm for condensing matrices under the 1-norm would yield a matrix whose number of rows was proportional to the number of columns of the original matrix raised to the power of 2.5. The best algorithm for condensing matrices under the 2-norm, however, would yield a matrix whose number of rows was proportional to the number of columns of the original matrix times its own logarithm.

That means that if the matrix had 100 columns, under the 1-norm, the best possible condensation, before Peng and Cohen's work, was a matrix with hundreds of thousands of rows. Under the 2-norm, it was a matrix with a couple of hundred rows. That discrepancy grows as the number of columns increases.

Taming recursion

Peng and Cohen's algorithm condenses matrices under the 1-norm as well as it does under the 2-norm; under the 2-norm, it condenses matrices as well as its predecessors do. That's because, for the 2-norm, it simply uses the best existing algorithm. For the 1-norm, it uses the same algorithm, but it uses it five or six times.

The paper's real contribution is to mathematically prove that the 2-norm algorithm will yield reliable results under the 1-norm. As Peng explains, an equation for calculating 1-norm weights has been known for some time. But "the funny thing with that definition is that it's recursive," he says. "So the correct set of weights appears on both the left-hand side and the right-hand side." That is, the weight for a given row—call it w—is set equal to a mathematical expression that itself includes w.

"This definition was known to exist, but people in stats didn't know what to do with it," Peng says. "They look at it and think, 'How do I ever compute anything with this?'"

What Peng and Cohen prove is that if you start by setting the w on the right side of the equation equal to 1, then evaluate the expression and plug the answer back into the right-hand w, then do the same thing again, and again, you'll quickly converge on a good approximation of the correct value of w.

"It's highly elegant mathematics, and it gives a significant advance over previous results," says Richard Karp, a professor of computer science at the University of California at Berkeley and a winner of the National Medal of Science and of the Turing Award, the highest honor in computer science. "It boils the original problem down to a very simple-to-understand one. I admire the mathematical development that went into it."

Explore further: Explained: Matrices

More information: "ℓp Row Sampling by Lewis Weights."

Related Stories

Explained: Matrices

December 6, 2013

Among the most common tools in electrical engineering and computer science are rectangular grids of numbers known as matrices. The numbers in a matrix can represent data, and they can also represent mathematical equations. ...

Unraveling the Matrix

July 29, 2010

A new way of analyzing grids of numbers known as matrices could improve signal-processing applications and data-compression schemes.

Math model helps explain how conformity works

March 11, 2015

(—A pair of anthropologists has come up with a math model to help better understand individual conformity and how it relates to groups and societies as a whole. In their paper published in Royal Society Open Science, ...

Fundamental algorithm gets first improvement in 10 years

September 27, 2010

The maximum-flow problem, or max flow, is one of the most basic problems in computer science: First solved during preparations for the Berlin airlift, it’s a component of many logistical problems and a staple of introductory ...

Researchers build Quad HD TV chip

February 20, 2013

It took only a few years for high-definition televisions to make the transition from high-priced novelty to ubiquitous commodity—and they now seem to be heading for obsolescence just as quickly. At the Consumer Electronics ...

Recommended for you

Team breaks world record for fast, accurate AI training

November 7, 2018

Researchers at Hong Kong Baptist University (HKBU) have partnered with a team from Tencent Machine Learning to create a new technique for training artificial intelligence (AI) machines faster than ever before while maintaining ...


Adjust slider to filter visible comments by rank

Display comments: newest first

not rated yet May 23, 2015
As a layman it's difficult for me to understand all the implications but one question comes to my mind. What if any effects can this have on the P=NP puzzle?
not rated yet Jun 11, 2015
Could this algorithm be used for model-based clustering of large data sets?

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.