Big-data analysis consists of searching for buried patterns that have some kind of predictive power. But choosing which "features" of the data to analyze usually requires some human intuition. In a database containing, say, the beginning and end dates of various sales promotions and weekly profits, the crucial data may not be the dates themselves but the spans between them, or not the total profits but the averages across those spans.
MIT researchers aim to take the human element out of big-data analysis, with a new system that not only searches for patterns but designs the feature set, too. To test the first prototype of their system, they enrolled it in three data science competitions, in which it competed against human teams to find predictive patterns in unfamiliar data sets. Of the 906 teams participating in the three competitions, the researchers' "Data Science Machine" finished ahead of 615.
In two of the three competitions, the predictions made by the Data Science Machine were 94 percent and 96 percent as accurate as the winning submissions. In the third, the figure was a more modest 87 percent. But where the teams of humans typically labored over their prediction algorithms for months, the Data Science Machine took somewhere between two and 12 hours to produce each of its entries.
"We view the Data Science Machine as a natural complement to human intelligence," says Max Kanter, whose MIT master's thesis in computer science is the basis of the Data Science Machine. "There's so much data out there to be analyzed. And right now it's just sitting there not doing anything. So maybe we can come up with a solution that will at least get us started on it, at least get us moving."
Between the lines
Kanter and his thesis advisor, Kalyan Veeramachaneni, a research scientist at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), describe the Data Science Machine in a paper that Kanter will present next week at the IEEE International Conference on Data Science and Advanced Analytics.
Veeramachaneni co-leads the Anyscale Learning for All group at CSAIL, which applies machine-learning techniques to practical problems in big-data analysis, such as determining the power-generation capacity of wind-farm sites or predicting which students are at risk for dropping out of online courses.
"What we observed from our experience solving a number of data science problems for industry is that one of the very critical steps is called feature engineering," Veeramachaneni says. "The first thing you have to do is identify what variables to extract from the database or compose, and for that, you have to come up with a lot of ideas."
In predicting dropout, for instance, two crucial indicators proved to be how long before a deadline a student begins working on a problem set and how much time the student spends on the course website relative to his or her classmates. MIT's online-learning platform MITx doesn't record either of those statistics, but it does collect data from which they can be inferred.
Featured composition
Kanter and Veeramachaneni use a couple of tricks to manufacture candidate features for data analyses. One is to exploit structural relationships inherent in database design. Databases typically store different types of data in different tables, indicating the correlations between them using numerical identifiers. The Data Science Machine tracks these correlations, using them as a cue to feature construction.
For instance, one table might list retail items and their costs; another might list items included in individual customers' purchases. The Data Science Machine would begin by importing costs from the first table into the second. Then, taking its cue from the association of several different items in the second table with the same purchase number, it would execute a suite of operations to generate candidate features: total cost per order, average cost per order, minimum cost per order, and so on. As numerical identifiers proliferate across tables, the Data Science Machine layers operations on top of each other, finding minima of averages, averages of sums, and so on.
It also looks for so-called categorical data, which appear to be restricted to a limited range of values, such as days of the week or brand names. It then generates further feature candidates by dividing up existing features across categories.
Once it's produced an array of candidates, it reduces their number by identifying those whose values seem to be correlated. Then it starts testing its reduced set of features on sample data, recombining them in different ways to optimize the accuracy of the predictions they yield.
"The Data Science Machine is one of those unbelievable projects where applying cutting-edge research to solve practical problems opens an entirely new way of looking at the problem," says Margo Seltzer, a professor of computer science at Harvard University who was not involved in the work. "I think what they've done is going to become the standard quickly—very quickly."
Explore further:
New techniques could help identify students at risk for dropping out of online courses
More information:
"Deep Feature Synthesis: Towards Automating Data Science Endeavors." groups.csail.mit.edu/EVO-DesignOpt/groupWebSite/uploads/Site/DSAA_DSM_2015.pdf

Vidyaguy
5 / 5 (2) Oct 16, 2015One of the problems with AI, if it does transcend that special human capability, is that human existence either loses its purpose, or it manages to make the shift from seeker to helper.
SuperThunder
1 / 5 (3) Oct 16, 2015caroline_green_c91
1 / 5 (1) Oct 16, 2015-Caroline
Creative BioMart
I Have Questions
1 / 5 (1) Oct 17, 2015sascoflame
not rated yet Oct 17, 2015EyeNStein
1 / 5 (1) Oct 17, 2015The fact that fast methodical data crunching computers often outperform the human preference to guess then check solutions on any data set is not at all surprising. The fact that humans still outperformed computers on some data solution tasks just shows we are still more flexible than programming will ever be.
jerromyjon
3 / 5 (2) Oct 17, 2015The fact is that no human understands how their mind does what it does in a 1+1=2 basis. The only way to level the field is to allow the "machine" to decide what it thinks about and give it goals to achieve to "survive". When it is implemented correctly it could determine it needs to take over the world to prevent humanity from destroying itself, which it needs for survival. That would be inhumane to the machine.
Or we could forget this mostly useless data and concentrate on saving ourselves and when we achieve a sustainable peaceful future the "machines" could mutually coexist with us, complementing our existence.
Dug
not rated yet Oct 17, 2015As an average human intuit, I find it suspicious that Google can't produce a highly accurate search engine (actually less specific than it use to be) - while MIT researchers can produce a program that sorts less than obvious but significant patterns from data. Of course MIT hasn't yet faced monetizing its algorithms and the human greed factor that invariably reduces their performance efficiency.
jeffensley
3 / 5 (2) Oct 17, 2015Exactly... science needs guidance from a moral body composed of spiritual leaders, philosophers
and researchers to help choose paths to follow and those to avoid. Right now, "I wonder if we can do this?" is the only question that seems to be asked in regards to research. In regards to the subject above, no one seems to play out the consequences of letting machines do everything (including thinking) for us. That scenario needs to be considered for every form of research that has the potential to change the world as we know it.. astrophysics, genetics, atomic physics, etc...
jerromyjon
not rated yet Oct 18, 2015I see the effects on a regular basis. Simple example, how many people still remember important phone numbers? In my experience perhaps 10% feel it is important to maintain unaided mental abilities. Very scary.
ProcrastinationAccountNumber3659
1 / 5 (4) Oct 18, 2015In the end computers are just tools. We have progressed significantly with tools to augment our physical capabilities and now we are using tools that augment our mental capabilities.
marcush
2.3 / 5 (3) Oct 21, 2015So you'd rather prejudices narrow our knowledge? e.g. creationists tell us not to study evolution? You've got to be joking. Its up to us how we use the findings of Science - after the fact.
marcush
1 / 5 (1) Oct 21, 2015You have to be careful when using the term "purpose" as it can imply something supernatural. Perhaps what you mean is a satisfying lifestyle. Again, its up to us to decide how technology is utilised whether its AI or H-bombs. Technology has always been a two-edged sword.
marcush
1 / 5 (1) Oct 21, 2015jeffensley
5 / 5 (2) Oct 26, 2015Where did I say let Creationists make the calls? I would intentionally include people from a variety of backgrounds. "Discovery" alone is not a good enough reason to pursue some paths. Why don't we do experiments on human children to see how much pain they can tolerate before losing consciousness?