Cloud computing: For database-driven applications, new software could reduce hardware requirements by 95 percent

Mar 12, 2013 by Larry Hardesty
Making cloud computing more efficient
Credit: CHRISTINE DANILOFF/MIT

For many companies, moving their web-application servers to the cloud is an attractive option, since cloud-computing services can offer economies of scale, extensive technical support and easy accommodation of demand fluctuations.

But for applications that depend heavily on queries, cloud hosting can pose as many problems as it solves. Cloud services often partition their servers into "," each of which gets so many operations per second on a server's , so much space in memory, and the like. That makes cloud servers easier to manage, but for database-intensive applications, it can result in the allocation of about 20 times as much hardware as should be necessary. And the cost of that overprovisioning gets passed on to customers.

MIT researchers are developing a new system called DBSeer that should help solve this problem and others, such as the pricing of cloud services and the diagnosis of application slowdowns. At the recent Biennial Conference on Innovative Data Systems Research, the researchers laid out their vision for DBSeer. And in June, at the annual meeting of the Association for Computing Machinery's Special Interest Group on Management of Data (SIGMOD), they will unveil the algorithms at the heart of DBSeer, which use machine-learning techniques to build accurate models of performance and resource demands of database-driven applications.

DBSeer's advantages aren't restricted to cloud computing, either. Teradata, a major database company, has already assigned several of its engineers the task of importing the MIT researchers' new algorithm—which has been released under an open-source license—into its own software.

Virtual limitations

Barzan Mozafari, a postdoc in the lab of professor of electrical engineering and computer science Samuel Madden and lead author on both new papers, explains that, with virtual machines, server resources must be allocated according to an application's peak demand. "You're not going to hit your peak load all the time," Mozafari says. "So that means that these resources are going to be underutilized most of the time."

Moreover, Mozafari says, virtual machines are, by design, isolated from each other: They can't share resources, even when they're running on the same physical server. With databases, that can mean wasteful duplication of a great deal of data.

And even the provisioning for peak demand is largely guesswork. "It's very counterintuitive," Mozafari says, "but you might take on certain types of extra load that might help your overall performance." Increased demand means that a database server will store more of its frequently used data in its high-speed memory, which can help it process requests more quickly.

On the other hand, a slight increase in demand could cause the system to slow down precipitously—if, for instance, too many requests require modification of the same pieces of data, which need to be updated on multiple servers. "It's extremely nonlinear," Mozafari says.

Mozafari, Madden, postdoc Alekh Jindal, and Carlo Curino, a former member of Madden's group who's now at Microsoft, use two different techniques in the SIGMOD paper to predict how a database-driven application will respond to increased load. Mozafari describes the first as a "black box" approach: DBSeer simply monitors fluctuations in both the number and type of user requests and system performance and uses machine-learning techniques to correlate the two. This approach is good at predicting the consequences of fluctuations that don't fall too far outside the range of the training data.

Gray areas

Often, however, database managers—or prospective cloud-computing customers—will be interested in the consequences of a fourfold, tenfold, or even hundredfold increase in demand. For those types of predictions, Mozafari explains, DBSeer uses a "gray box" model, which takes into account the idiosyncrasies of particular database systems.

For instance, Mozafari explains, updating data stored on a hard drive is time-consuming, so most database servers will try to postpone that operation as long as they can, instead storing data modifications in the much faster—but volatile—main memory. At some point, however, the server has to commit its pending modifications to disk, and the criteria for making that decision can vary from one database system to another.

The version of DBSeer presented at SIGMOD includes a gray-box model of MySQL, one of the most widely used database systems. The researchers are currently building a new model for another popular system, PostgreSQL. Although adapting the model isn't a negligible undertaking, models tailored to just a handful of systems would cover the large majority of database-driven Web applications.

The researchers tested their prediction algorithm against both a set of benchmark data, called TPC-C, that's commonly used in database research and against real-world data on modifications to the Wikipedia database. On average, the model was about 80 percent accurate in predicting CPU use and 99 percent accurate in predicting the bandwidth consumed by disk operations.

"We're really fascinated and thrilled that someone is doing this work," says Doug Brown, a database software architect at Teradata. "We've already taken the code and are prototyping right now." Initially, Brown says, Teradata will use the MIT researchers' prediction algorithm to determine customers' resource requirements. "The really big question for our customers is, 'How are we going to scale?'" Brown says.

Brown hopes, however, that the algorithm will ultimately help allocate server resources on the fly, as database requests come in. If servers can assess the demands imposed by individual requests and budget accordingly, they can ensure that transaction times stay within the bounds set by customers' service agreements. For instance, "if you have two big, big resource consumers, you can calculate ahead of time that we're only going to run two of these in parallel," Brown says. "There's all kinds of games you can play in workload management."

Explore further: Hackathon team's GoogolPlex gives Siri extra powers

More information: Paper (PDF): Performance and Resource Modeling in Highly-Concurrent OLTP Workloads

Related Stories

Making Web applications more efficient

Aug 31, 2012

Most major websites these days maintain huge databases: Shopping sites have databases of inventory and customer ratings, travel sites have databases of seat availability on flights, and social-networking ...

Recommended for you

Android gains in US, basic phones almost extinct

11 hours ago

The Google Android platform grabbed the majority of mobile phones in the US market in early 2014, as consumers all but abandoned non-smartphone handsets, a survey showed Friday.

Hackathon team's GoogolPlex gives Siri extra powers

Apr 17, 2014

(Phys.org) —Four freshmen at the University of Pennsylvania have taken Apple's personal assistant Siri to behave as a graduate-level executive assistant which, when asked, is capable of adjusting the temperature ...

Microsoft CEO is driving data-culture mindset

Apr 16, 2014

(Phys.org) —Microsoft's future strategy: is all about leveraging data, from different sources, coming together using one cohesive Microsoft architecture. Microsoft CEO Satya Nadella on Tuesday, both in ...

User comments : 4

Adjust slider to filter visible comments by rank

Display comments: newest first

Muhammad Naeem ul Fateh
1 / 5 (3) Mar 12, 2013
If logic is designed cleverly, processing time will be reduced. Hence engagement of core hardware will be free to take the job without delays, despite parallel execution do occurs but cleverly designed application can reduce burden on hardware. Muhammad Naeem Ul Fateh
baudrunner
1 / 5 (2) Mar 12, 2013
And companies will still keep their own backups, which contain the same data as exists in the cloud. In-house data base management systems aren't difficult to develop, so what's the point of the cloud? Is this just another way for MS to make money without having to actually manufacture or publish anything? Or for ISP's to charge more for using up bandwidth unnecessarily and bottlenecking other internet users? I don't get it. Just another example of doing something just because it is possible. Tch.
baudrunner
1 / 5 (2) Mar 12, 2013
Not to mention insecure.
Antoweif
1 / 5 (3) Mar 13, 2013
ISP's that charge for bandwidth (data sent/received) will be history in the future. They are already turtles in the marked today.
Internet today is insecure because no one bothers to make it secure. We could make it secure today if we really wanted to, but no one really cares since there's a lot of money involved with security.

More news stories

Researchers uncover likely creator of Bitcoin

The primary author of the celebrated Bitcoin paper, and therefore probable creator of Bitcoin, is most likely Nick Szabo, a blogger and former George Washington University law professor, according to students ...

LinkedIn membership hits 300 million

The career-focused social network LinkedIn announced Friday it has 300 million members, with more than half the total outside the United States.

Impact glass stores biodata for millions of years

(Phys.org) —Bits of plant life encapsulated in molten glass by asteroid and comet impacts millions of years ago give geologists information about climate and life forms on the ancient Earth. Scientists ...