Is your big data messy? We're making an app for that

February 16, 2017 by Cory Nealon, University at Buffalo
Credit: University at Buffalo

Like a teenager's bedroom, big data is often messy.

Malfunctioning computers, data entry errors and other hard-to-spot problems can skew datasets and mislead people—everyone from data scientists to data hobbyists—trying to draw conclusions from raw data.Vizier, a software tool under development by a University at Buffalo-led research team, aims to proactively catch those errors.

The project, backed by a $2.7 million National Science Foundation grant, launched in January. Like Excel and other spreadsheet software, Vizier will allow users to interactively work with datasets. For example, it will help people explore, clean, curate and visualize data in meaningful ways, as well as spot errors and offer solutions.

But unlike spreadsheet software, Vizier is intended for much larger datasets; it will be used to examine millions or billions of data points, as opposed to hundreds or thousands typically plugged into spreadsheet software.

"We are creating a tool that'll let you work with the data you have, and also unobtrusively make helpful observations like 'Hmm... have you noticed that two out of a million records make a 10 percent difference in this average?'" says Oliver Kennedy, PhD, assistant professor of computer science and engineering at UB, and the grant's principal investigator.

Co-principal investigators include Juliana Freire, professor of and engineering at New York University, and Boris Glavic, assistant professor in the Department of Computer Science at the Illinois Institute of Technology. The award is from NSF's Data Infrastructure Building Blocks (DIBBs) program.

For years, companies like Google, Microsoft and Apple have utilized to improve their products and services. That same power is now spreading to the masses as government agencies in the United States and elsewhere publish massive amounts of public data on the internet.

For example, New York City and the federal government have open data portals making it possible for anyone with an internet connection to download information and ask questions about their government. When properly used, these portals can shed light on issues relating to health code violations, discrimination, bias and other matters, Kennedy said.Vizier will be released as free, open-source software.

"We want to make it easier for data scientists—and eventually data hobbyists—to discover and communicate not only what the data says, but why the says that," he said.

Explore further: Enron becomes unlikely data source for computer science researchers

Related Stories

A data-cleaning tool for building better prediction models

August 31, 2016

Big data sets are full of dirty data, and these outliers, typos and missing values can produce distorted models that lead to wrong conclusions and bad decisions, be it in healthcare or finance. With so much at stake, data ...

Powerful new software plug-in detects bugs in spreadsheets

October 24, 2014

An effective new data-debugging software tool dubbed "CheckCell" was released to the public this week in a presentation by University of Massachusetts Amherst computer science doctoral student Daniel Barowy. He spoke at the ...

'Draw me a picture,' say scientists: Computer may respond

December 17, 2014

Like the rest of us, scientists wish they could just ask a computer a question and have it respond with an answer presented in an easy-to-understand picture. Today's visualization tools can translate huge raw data sets into ...

Recommended for you

Researchers find tweeting in cities lower than expected

February 20, 2018

Studying data from Twitter, University of Illinois researchers found that less people tweet per capita from larger cities than in smaller ones, indicating an unexpected trend that has implications in understanding urban pace ...

Augmented reality takes 3-D printing to next level

February 20, 2018

Cornell researchers are taking 3-D printing and 3-D modeling to a new level by using augmented reality (AR) to allow designers to design in physical space while a robotic arm rapidly prints the work.

0 comments

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.