Shrinking 'ridiculous' data sets to manageable size

May 14, 2009 By Bill Steele

Two decades ago a renowned statistician described a computer data set of 1 billion bytes as "huge" and 10 trillion bytes as "ridiculous."

Today, thanks to the use of computers to collect and generate data, such ridiculously large data sets are common, from genome databases to search engine logs to Wal-Mart sales data. But the ability to monitor and process the data has not kept up with the ability to create it.

With a new three-year, $551,508 Young Investigator Award from the U.S. Office of Naval Research (ONR), Ping Li, Cornell assistant professor of statistical science, is taking a new mathematical approach. His goal: to "shrink" massive data sets into manageable approximations that can be processed in a reasonable length of time to detect such anomalies as denial-of-service attacks on the Internet or to enable computers to learn from experience for such applications as natural language processing, Web searching and computer vision.

"Instead of storing the whole data, we compute and store a sketch of the data, which is small enough to fit in the memory and still contains enough information to recover crucial relationships of the data," Li explained.

From the resulting sketch, Li says that it is possible, for example, to compute a quantity known as the Shannon entropy, which is, roughly, a measure of the degree of uncertainty in a body of information. A change in this would warn engineers of an anomaly such as a network failure, a large transfer of money or perhaps terrorist chatter. Li also plans to develop and publicly distribute software that can be used as part of machine-learning applications on massive and high-dimensional data sets.

The ONR Young Investigator Program identifies and supports academic scientists and engineers who have received a doctorate or equivalent degrees within the past five years and who show exceptional promise for doing cutting-edge research.

Provided by Cornell University (news : web)

Explore further: Computerized emotion detector

add to favorites email to friend print save as pdf

Related Stories

New tool enables powerful data analysis

Jan 08, 2009

(PhysOrg.com) -- A powerful computing tool that allows scientists to extract features and patterns from enormously large and complex sets of raw data has been developed by scientists at University of California, ...

Model helps computers sort data more like humans

Aug 25, 2008

(PhysOrg.com) -- Humans have a natural tendency to find order in sets of information, a skill that has proven difficult to replicate in computers. Faced with a large set of data, computers don't know where ...

Home computers to help researchers better understand universe

Oct 24, 2007

Want to help unravel the mysteries of the universe" A new distributed computing project designed by a University of Illinois researcher allows people around the world to participate in cutting-edge cosmology research by donating ...

Recommended for you

Computerized emotion detector

23 hours ago

Face recognition software measures various parameters in a mug shot, such as the distance between the person's eyes, the height from lip to top of their nose and various other metrics and then compares it with photos of people ...

Cutting the cloud computing carbon cost

Sep 12, 2014

Cloud computing involves displacing data storage and processing from the user's computer on to remote servers. It can provide users with more storage space and computing power that they can then access from anywhere in the ...

Teaching computers the nuances of human conversation

Sep 12, 2014

Computer scientists have successfully developed programs to recognize spoken language, as in automated phone systems that respond to voice prompts and voice-activated assistants like Apple's Siri.

Mapping the connections between diverse sets of data

Sep 12, 2014

What is a map? Most often, it's a visual tool used to demonstrate the relationship between multiple places in geographic space. They're useful because you can look at one and very quickly pick up on the general ...

User comments : 0