May 20, 2015

Algorithm reduces size of data sets while preserving their mathematical properties

by Larry Hardesty, Massachusetts Institute of Technology

As anyone who's ever used a spreadsheet can attest, it's often convenient to organize data into tables. But in the age of big data, those tables can be enormous, with millions or even hundreds of millions of rows.

One way to make big-data analysis computationally practical is to reduce the size of data tables—or matrices, to use the mathematical term—by leaving out a bunch of rows. The trick is that the remaining rows have to be in some sense representative of the ones that were omitted, in order for computations performed on them to yield approximately the right results.

At the ACM Symposium on Theory of Computing in June, MIT researchers will present a new algorithm that finds the smallest possible approximation of the original matrix that guarantees reliable computations. For a class of problems important in engineering and machine learning, this is a significant improvement over previous techniques. And for all classes of problems, the algorithm finds the approximation as quickly as possible.

In order to determine how well a given row of the condensed matrix represents a row of the original matrix, the algorithm needs to measure the "distance" between them. But there are different ways to define "distance."

One common way is so-called "Euclidean distance." In Euclidean distance, the differences between the entries at corresponding positions in the two rows are squared and added together, and the distance between rows is the square root of the resulting sum. The intuition is that of the Pythagorean theorem: The square root of the sum of the squares of the lengths of a right triangle's legs gives the length of the hypotenuse.

Another measure of distance is less common but particularly useful in solving machine-learning and other optimization problems. It's called "Manhattan distance," and it's simply the sum of the absolute differences between the corresponding entries in the two rows.

Inside the norm

In fact, both Manhattan distance and Euclidean distance are instances of what statisticians call "norms." The Manhattan distance, or 1-norm, is the first root of the sum of differences raised to the first power, and the Euclidean distance, or 2-norm, is the square root of the sum of differences raised to the second power. The 3-norm is the cube root of the sum of differences raised to the third power, and so on to infinity.

In their paper, the MIT researchers—Richard Peng, a postdoc in applied mathematics, and Michael Cohen, a graduate student in electrical engineering and computer science—demonstrate that their algorithm is optimal for condensing matrices under any norm. But according to Peng, "The one we really cared about was the 1-norm."

In matrix condensation—under any norm—the first step is to assign each row of the original matrix a "weight." A row's weight represents the number of other rows that it's similar to, and it determines the likelihood that the row will be included in the condensed matrix. If it is, its values will be multiplied according to its weight. So, for instance, if 10 rows are good stand-ins for each other, but not for any other rows of the matrix, each will have a 10 percent chance of getting into the condensed matrix. If one of them does, its entries will all be multiplied by 10, so that it will reflect the contribution of the other nine rows it's standing in for.

Although Manhattan distance is in some sense simpler than Euclidean distance, it makes calculating rows' weights more difficult. Previously, the best algorithm for condensing matrices under the 1-norm would yield a matrix whose number of rows was proportional to the number of columns of the original matrix raised to the power of 2.5. The best algorithm for condensing matrices under the 2-norm, however, would yield a matrix whose number of rows was proportional to the number of columns of the original matrix times its own logarithm.

That means that if the matrix had 100 columns, under the 1-norm, the best possible condensation, before Peng and Cohen's work, was a matrix with hundreds of thousands of rows. Under the 2-norm, it was a matrix with a couple of hundred rows. That discrepancy grows as the number of columns increases.

Taming recursion

Peng and Cohen's algorithm condenses matrices under the 1-norm as well as it does under the 2-norm; under the 2-norm, it condenses matrices as well as its predecessors do. That's because, for the 2-norm, it simply uses the best existing algorithm. For the 1-norm, it uses the same algorithm, but it uses it five or six times.

The paper's real contribution is to mathematically prove that the 2-norm algorithm will yield reliable results under the 1-norm. As Peng explains, an equation for calculating 1-norm weights has been known for some time. But "the funny thing with that definition is that it's recursive," he says. "So the correct set of weights appears on both the left-hand side and the right-hand side." That is, the weight for a given matrix row—call it w—is set equal to a mathematical expression that itself includes w.

"This definition was known to exist, but people in stats didn't know what to do with it," Peng says. "They look at it and think, 'How do I ever compute anything with this?'"

What Peng and Cohen prove is that if you start by setting the w on the right side of the equation equal to 1, then evaluate the expression and plug the answer back into the right-hand w, then do the same thing again, and again, you'll quickly converge on a good approximation of the correct value of w.

"It's highly elegant mathematics, and it gives a significant advance over previous results," says Richard Karp, a professor of computer science at the University of California at Berkeley and a winner of the National Medal of Science and of the Turing Award, the highest honor in computer science. "It boils the original problem down to a very simple-to-understand one. I admire the mathematical development that went into it."

More information: "ℓp Row Sampling by Lewis Weights." arxiv.org/abs/1412.0588

Provided by Massachusetts Institute of Technology

This story is republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news about MIT research, innovation and teaching.

Citation: Algorithm reduces size of data sets while preserving their mathematical properties (2015, May 20) retrieved 26 April 2024 from https://phys.org/news/2015-05-algorithm-size-mathematical-properties.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Explained: Matrices

3404 shares

Feedback to editors

Optical barcodes expand range of high-resolution sensor

8 hours ago

Ridesourcing platforms thrive on socio-economic inequality, say researchers

9 hours ago

Did Vesuvius bury the home of the first Roman emperor?

9 hours ago

Florida dolphin found with highly pathogenic avian flu: Report

9 hours ago

A new way to study and help prevent landslides

9 hours ago

New algorithm cuts through 'noisy' data to better predict tipping points

10 hours ago

Researchers reconstruct landscapes that greeted the first humans in Australia around 65,000 years ago

10 hours ago

High-precision blood glucose level prediction achieved by few-molecule reservoir computing

10 hours ago

Enhancing memory technology: Multiferroic nanodots for low-power magnetic storage

11 hours ago

Researchers advance detection of gravitational waves to study collisions of neutron stars and black holes

11 hours ago

Load comments (2)

Algorithm reduces size of data sets while preserving their mathematical properties

Inside the norm

Taming recursion

Optical barcodes expand range of high-resolution sensor

Ridesourcing platforms thrive on socio-economic inequality, say researchers

Did Vesuvius bury the home of the first Roman emperor?

Florida dolphin found with highly pathogenic avian flu: Report

A new way to study and help prevent landslides

New algorithm cuts through 'noisy' data to better predict tipping points

Researchers reconstruct landscapes that greeted the first humans in Australia around 65,000 years ago

High-precision blood glucose level prediction achieved by few-molecule reservoir computing

Enhancing memory technology: Multiferroic nanodots for low-power magnetic storage

Researchers advance detection of gravitational waves to study collisions of neutron stars and black holes

Relevant PhysicsForums posts

Passing variables in FORTRAN

Parallel processing for loops and pointer defined outside the loop

My Website For Creating Interactive Visuals Linked To Equations

Number of Multiplications in the FFT Algorithm

Error logging in: onLoginSuccess is not a function

Latest Notable AI accomplishments

Explained: Matrices

Unraveling the Matrix

Math model helps explain how conformity works

Fundamental algorithm gets first improvement in 10 years

Researchers build Quad HD TV chip

Study shows smaller rows contribute to more soybean yields in colder climates

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

Algorithm reduces size of data sets while preserving their mathematical properties

Inside the norm

Taming recursion

Optical barcodes expand range of high-resolution sensor

Ridesourcing platforms thrive on socio-economic inequality, say researchers

Did Vesuvius bury the home of the first Roman emperor?

Florida dolphin found with highly pathogenic avian flu: Report

A new way to study and help prevent landslides

New algorithm cuts through 'noisy' data to better predict tipping points

Researchers reconstruct landscapes that greeted the first humans in Australia around 65,000 years ago

High-precision blood glucose level prediction achieved by few-molecule reservoir computing

Enhancing memory technology: Multiferroic nanodots for low-power magnetic storage

Researchers advance detection of gravitational waves to study collisions of neutron stars and black holes

Relevant PhysicsForums posts

Related Stories

Explained: Matrices

Unraveling the Matrix

Math model helps explain how conformity works

Fundamental algorithm gets first improvement in 10 years

Researchers build Quad HD TV chip

Study shows smaller rows contribute to more soybean yields in colder climates

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience