Shrinking 'ridiculous' data sets to manageable size

May 14, 2009 By Bill Steele

Two decades ago a renowned statistician described a computer data set of 1 billion bytes as "huge" and 10 trillion bytes as "ridiculous."

Today, thanks to the use of computers to collect and generate data, such ridiculously large data sets are common, from genome databases to search engine logs to Wal-Mart sales data. But the ability to monitor and process the data has not kept up with the ability to create it.

With a new three-year, $551,508 Young Investigator Award from the U.S. Office of Naval Research (ONR), Ping Li, Cornell assistant professor of statistical science, is taking a new mathematical approach. His goal: to "shrink" massive data sets into manageable approximations that can be processed in a reasonable length of time to detect such anomalies as denial-of-service attacks on the Internet or to enable computers to learn from experience for such applications as natural language processing, Web searching and computer vision.

"Instead of storing the whole data, we compute and store a sketch of the data, which is small enough to fit in the memory and still contains enough information to recover crucial relationships of the data," Li explained.

From the resulting sketch, Li says that it is possible, for example, to compute a quantity known as the Shannon entropy, which is, roughly, a measure of the degree of uncertainty in a body of information. A change in this would warn engineers of an anomaly such as a network failure, a large transfer of money or perhaps terrorist chatter. Li also plans to develop and publicly distribute software that can be used as part of machine-learning applications on massive and high-dimensional data sets.

The ONR Young Investigator Program identifies and supports academic scientists and engineers who have received a doctorate or equivalent degrees within the past five years and who show exceptional promise for doing cutting-edge research.

Provided by Cornell University (news : web)

Explore further: Algorithm accounts for uncertainty to enable more accurate modeling

Related Stories

New tool enables powerful data analysis

Jan 08, 2009

( -- A powerful computing tool that allows scientists to extract features and patterns from enormously large and complex sets of raw data has been developed by scientists at University of California, ...

Model helps computers sort data more like humans

Aug 25, 2008

( -- Humans have a natural tendency to find order in sets of information, a skill that has proven difficult to replicate in computers. Faced with a large set of data, computers don't know where ...

Home computers to help researchers better understand universe

Oct 24, 2007

Want to help unravel the mysteries of the universe" A new distributed computing project designed by a University of Illinois researcher allows people around the world to participate in cutting-edge cosmology research by donating ...

Recommended for you

Cattle ID system shows its muzzle

Jun 29, 2015

Maybe it sounds like a cow and bull story, but researchers in Egypt are developing a biometric identification system for cattle that could reduce food fraud and allow ranchers to control their stock more efficiently. The ...

Combining personalization and privacy for user data

Jun 29, 2015

Computer scientists and legal experts from Trinity College Dublin and SFI's ADAPT centre are working to marry two of cyberspace's greatest desires, by simultaneously providing enhanced options for user personalisation alongside ...

User comments : 0

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.