February 25, 2015

Researcher tackles some of the biggest bottlenecks holding back the data science industry

by Eric Brown, Massachusetts Institute of Technology

When Kalyan Veeramachaneni joined the Any Scale Learning For All (ALFA) group at MIT's CSAIL as a postdoc in 2010, he worked on large-scale machine-learning platforms that enable the construction of models from huge data sets. "The question then was how to decompose a learning algorithm and data into pieces, so each piece could be locally loaded into different machines and several models could be learnt independently," says Veeramachaneni, currently a research scientist at ALFA.

"We then had to decompose the learning algorithm so we could parallelize the compute on each node," says Veeramachaneni. "In this way, the data on each node could be learned by the system, and then we could combine all the solutions and models that had been independently learned."

By 2013, once ALFA had built multiple platforms to accomplish these goals, the team started on a new problem: the growing bottleneck caused by the process of translating the raw data into the formats required by most machine-learning systems.

"Machine-learning systems usually require a covariates table in a column-wise format, as well as a response variable that we try to predict," says Veeramachaneni. "The process to get these from raw data involves curation, syncing and linking of data, and even generating ideas for variables that we can then operationalize and form."

Much of Veeramachaneni's recent research has focused on how to automate this lengthy data prep process. "Data scientists go to all these boot camps in Silicon Valley to learn open source big data software like Hadoop, and they come back, and say 'Great, but we're still stuck with the problem of getting the raw data to a place where we can use all these tools,'" Veeramachaneni says.

Veeramachaneni and his team are also exploring how to efficiently integrate the expertise of domain experts, "so it won't take up too much of their time," he says. "Our biggest challenge is how to use human input efficiently, and how to make the interactions seamless and efficient. What sort of collaborative frameworks and mechanisms can we build to increase the pool of people who participate?"

GigaBeats and BeatDB

One project in which Veeramachaneni tested his automated data prep concepts was ALFA's GigaBeats project. GigaBeats analyzes arterial blood pressure signals from thousands of patients to predict a future condition. With GigaBeats, numerous steps are involved to prepare the data for analysis, says Veeramachaneni. These include cleaning and conditioning, low pass filters, and extracting features by applying signal-level transformations.

The Biggest Bottleneck in Data Science

Many of these steps involve human decision-making. Often, domain experts know how to do it, but sometimes it's up to the computer scientist. In either case, there's no easy way to go back and revisit those human interventions when a choice made later in the pipeline does not result in the expected level of predictive accuracy, says Veeramachaneni.

Recently, ALFA has built some novel platforms that automate the process, shrinking the prep time from months to a few days. To automate and accelerate data translation, while also enabling visibility into earlier decision-making, ALFA has developed a "complete solution" called BeatDB.

"With BeatDB, we have tunable parameters that in some cases can be input by domain experts, and the rest are automatically tuned," says Veeramachaneni. "From this, we can learn how decisions made at the low-level, raw representation stage can impact the final predicted accuracy efficacy. This deep-mining solution combines all layers of machine learning into a single pipeline and then optimizes and tunes with other machine-learning algorithms on top of it. It really enables fast discovery."

Now that ALFA has made progress on integrating and recording human input, the group is also looking for better ways to present the processed data. For example, when showing GigaBeats data to medical professionals, "they are often much more comfortable if a better representation is given to them instead of showing them raw data," says Veeramachaneni. "It makes it easier to provide input. A lot of our focus is on improving the presentation so we can more easily pull their input into our algorithms, clean or fix the data, or create variables."

A crowdsourcing solution

While automating ALFA's machine-learning pipelines, Veeramachaneni has also contributed to a number of real-world analytics projects. Recently, he has been analyzing raw click data from massive open online courses (MOOCs) with the hopes of improving courseware. The initial project is to determine stop-out (drop-out) rates based on online click behavior.

"The online learning platforms record data coming from the interaction of hundreds of thousands of learners," says Veeramachaneni. "We are now able to identify variables that can predict stop-out on a single course. The next stage is to reveal the variables of stop-out and show how to improve the course design."

The first challenge in the MOOC project was to organize the data. There are multiple data streams in addition to clickstream data, and they are usually spread over multiple databases and stored in multiple formats. Veeramachaneni has standardized these sources, integrating them into a single database called MOOCdb. "In this way, software written on top of the database can be re-used," says Veeramachaneni.

The next challenge is to decide what variables to look at. ALFA has explored all sorts of theories about MOOC behavior. For example, if a student is studying early in the morning, he or she is more likely to stay in the course. Another theory is based on dividing the time spent on the course by how many problems a student gets right. But, Veeramachaneni says, "If I'm trying to predict stop-out, there's no algorithm that automatically comes up with the behavioral variables that influence it. The biggest challenge is that the variables are defined by humans, which creates a big bottleneck."

They turned to crowdsourcing "to tap into as many people as we can," says Veeramachaneni. "We have built a crowdsourcing platform where people can submit an idea against problems such as stop-out," says Veeramachaneni. "Another set of people can operationalize that, such as writing a script to extract that variable on a per student basis."

This research could apply to a number of domains where analysts are trying to predict human behavior based on captured data, such as fraud detection, says Veeramachaneni. Banks and other companies are increasingly analyzing their transaction databases to try to determine whether the person doing the transaction is authentic.

"One variable would be how far the transaction happened from the person's home, or how the amount compares to the total that was spent by the person over the last year," says Veeramachaneni. "Coming up with these ideas is based on very relatable data with which we can all identify. So crowdsourcing could be helpful here, too."

Provided by Massachusetts Institute of Technology

This story is republished courtesy of MIT News (web.mit.edu/newsoffice/), a popular site that covers news about MIT research, innovation and teaching.

Citation: Researcher tackles some of the biggest bottlenecks holding back the data science industry (2015, February 25) retrieved 17 July 2024 from https://phys.org/news/2015-02-tackles-biggest-bottlenecks-science-industry.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Evolutionary approaches to big-data problems

57 shares

Feedback to editors

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

9 hours ago

Intensive farming could raise risk of new pandemics, researchers warn

10 hours ago

Scientists develop new AI method to create material 'fingerprints'

13 hours ago

Study shows frogs can quickly increase their tolerance to pesticides

14 hours ago

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

14 hours ago

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

14 hours ago

Scientists use machine learning to predict diversity of tree species in forests

15 hours ago

Physicists pool skills to better describe the unstable sigma meson particle

16 hours ago

Telescope tag-team discovers 10 strange and exotic pulsars

17 hours ago

NASA transmits hip-hop song to deep space for first time

17 hours ago

Load comments (0)

Researcher tackles some of the biggest bottlenecks holding back the data science industry

GigaBeats and BeatDB

A crowdsourcing solution

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Particle.js: Exploring Particle Physics with Web Technologies

Help solving a geometrical matching issue with Graph Neural Networks

5 GHz PC WiFi connection Cybersecurity question

Help with some optimization code for Block Matrices

Is an API Always Necessary for Server-Client Communication?

I did this POST message configuration damage to my wifi internet, help

Evolutionary approaches to big-data problems

Simoctocog alfa for haemophilia A: No suitable data

Google DeepMind acquisition researchers working on a Neural Turing Machine

Artificial intelligence helps physicists predict dangerous solar flares

Path-finder computes search strategy to find Waldo

Machines teach astronomers about stars

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

Researcher tackles some of the biggest bottlenecks holding back the data science industry

GigaBeats and BeatDB

A crowdsourcing solution

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Related Stories

Evolutionary approaches to big-data problems

Simoctocog alfa for haemophilia A: No suitable data

Google DeepMind acquisition researchers working on a Neural Turing Machine

Artificial intelligence helps physicists predict dangerous solar flares

Path-finder computes search strategy to find Waldo

Machines teach astronomers about stars

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience