Cloud computing: For database-driven applications, new software could reduce hardware requirements by 95 percent

March 12, 2013 by Larry Hardesty, Massachusetts Institute of Technology
Making cloud computing more efficient

For many companies, moving their web-application servers to the cloud is an attractive option, since cloud-computing services can offer economies of scale, extensive technical support and easy accommodation of demand fluctuations.

But for applications that depend heavily on queries, cloud hosting can pose as many problems as it solves. Cloud services often partition their servers into "," each of which gets so many operations per second on a server's , so much space in memory, and the like. That makes cloud servers easier to manage, but for database-intensive applications, it can result in the allocation of about 20 times as much hardware as should be necessary. And the cost of that overprovisioning gets passed on to customers.

MIT researchers are developing a new system called DBSeer that should help solve this problem and others, such as the pricing of cloud services and the diagnosis of application slowdowns. At the recent Biennial Conference on Innovative Data Systems Research, the researchers laid out their vision for DBSeer. And in June, at the annual meeting of the Association for Computing Machinery's Special Interest Group on Management of Data (SIGMOD), they will unveil the algorithms at the heart of DBSeer, which use machine-learning techniques to build accurate models of performance and resource demands of database-driven applications.

DBSeer's advantages aren't restricted to cloud computing, either. Teradata, a major database company, has already assigned several of its engineers the task of importing the MIT researchers' new algorithm—which has been released under an open-source license—into its own software.

Virtual limitations

Barzan Mozafari, a postdoc in the lab of professor of electrical engineering and computer science Samuel Madden and lead author on both new papers, explains that, with virtual machines, server resources must be allocated according to an application's peak demand. "You're not going to hit your peak load all the time," Mozafari says. "So that means that these resources are going to be underutilized most of the time."

Moreover, Mozafari says, virtual machines are, by design, isolated from each other: They can't share resources, even when they're running on the same physical server. With databases, that can mean wasteful duplication of a great deal of data.

And even the provisioning for peak demand is largely guesswork. "It's very counterintuitive," Mozafari says, "but you might take on certain types of extra load that might help your overall performance." Increased demand means that a database server will store more of its frequently used data in its high-speed memory, which can help it process requests more quickly.

On the other hand, a slight increase in demand could cause the system to slow down precipitously—if, for instance, too many requests require modification of the same pieces of data, which need to be updated on multiple servers. "It's extremely nonlinear," Mozafari says.

Mozafari, Madden, postdoc Alekh Jindal, and Carlo Curino, a former member of Madden's group who's now at Microsoft, use two different techniques in the SIGMOD paper to predict how a database-driven application will respond to increased load. Mozafari describes the first as a "black box" approach: DBSeer simply monitors fluctuations in both the number and type of user requests and system performance and uses machine-learning techniques to correlate the two. This approach is good at predicting the consequences of fluctuations that don't fall too far outside the range of the training data.

Gray areas

Often, however, database managers—or prospective cloud-computing customers—will be interested in the consequences of a fourfold, tenfold, or even hundredfold increase in demand. For those types of predictions, Mozafari explains, DBSeer uses a "gray box" model, which takes into account the idiosyncrasies of particular database systems.

For instance, Mozafari explains, updating data stored on a hard drive is time-consuming, so most database servers will try to postpone that operation as long as they can, instead storing data modifications in the much faster—but volatile—main memory. At some point, however, the server has to commit its pending modifications to disk, and the criteria for making that decision can vary from one database system to another.

The version of DBSeer presented at SIGMOD includes a gray-box model of MySQL, one of the most widely used database systems. The researchers are currently building a new model for another popular system, PostgreSQL. Although adapting the model isn't a negligible undertaking, models tailored to just a handful of systems would cover the large majority of database-driven Web applications.

The researchers tested their prediction algorithm against both a set of benchmark data, called TPC-C, that's commonly used in database research and against real-world data on modifications to the Wikipedia database. On average, the model was about 80 percent accurate in predicting CPU use and 99 percent accurate in predicting the bandwidth consumed by disk operations.

"We're really fascinated and thrilled that someone is doing this work," says Doug Brown, a database software architect at Teradata. "We've already taken the code and are prototyping right now." Initially, Brown says, Teradata will use the MIT researchers' prediction algorithm to determine customers' resource requirements. "The really big question for our customers is, 'How are we going to scale?'" Brown says.

Brown hopes, however, that the algorithm will ultimately help allocate server resources on the fly, as database requests come in. If servers can assess the demands imposed by individual requests and budget accordingly, they can ensure that transaction times stay within the bounds set by customers' service agreements. For instance, "if you have two big, big resource consumers, you can calculate ahead of time that we're only going to run two of these in parallel," Brown says. "There's all kinds of games you can play in workload management."

Explore further: IBM Offers Paid Support for No-Cost Data Server

More information: Paper (PDF): Performance and Resource Modeling in Highly-Concurrent OLTP Workloads

Related Stories

Making Web applications more efficient

August 31, 2012

Most major websites these days maintain huge databases: Shopping sites have databases of inventory and customer ratings, travel sites have databases of seat availability on flights, and social-networking sites have databases ...

Recommended for you

AI and 5G in focus at top mobile fair

February 24, 2018

Phone makers will seek to entice new buyers with better cameras and bigger screens at the world's biggest mobile fair starting Monday in Spain after a year of flat smartphone sales.

Google Assistant adds more languages in global push

February 23, 2018

Google said Friday its digital assistant software would be available in more than 30 languages by the end of the years as it steps up its artificial intelligence efforts against Amazon and others.


Adjust slider to filter visible comments by rank

Display comments: newest first

Muhammad Naeem ul Fateh
1 / 5 (3) Mar 12, 2013
If logic is designed cleverly, processing time will be reduced. Hence engagement of core hardware will be free to take the job without delays, despite parallel execution do occurs but cleverly designed application can reduce burden on hardware. Muhammad Naeem Ul Fateh
1 / 5 (2) Mar 12, 2013
And companies will still keep their own backups, which contain the same data as exists in the cloud. In-house data base management systems aren't difficult to develop, so what's the point of the cloud? Is this just another way for MS to make money without having to actually manufacture or publish anything? Or for ISP's to charge more for using up bandwidth unnecessarily and bottlenecking other internet users? I don't get it. Just another example of doing something just because it is possible. Tch.
1 / 5 (2) Mar 12, 2013
Not to mention insecure.
1 / 5 (3) Mar 13, 2013
ISP's that charge for bandwidth (data sent/received) will be history in the future. They are already turtles in the marked today.
Internet today is insecure because no one bothers to make it secure. We could make it secure today if we really wanted to, but no one really cares since there's a lot of money involved with security.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.