For many companies, moving their web-application servers to the cloud is an attractive option, since cloud-computing services can offer economies of scale, extensive technical support and easy accommodation of demand fluctuations.
But for applications that depend heavily on database queries, cloud hosting can pose as many problems as it solves. Cloud services often partition their servers into "virtual machines," each of which gets so many operations per second on a server's central processing unit, so much space in memory, and the like. That makes cloud servers easier to manage, but for database-intensive applications, it can result in the allocation of about 20 times as much hardware as should be necessary. And the cost of that overprovisioning gets passed on to customers.
MIT researchers are developing a new system called DBSeer that should help solve this problem and others, such as the pricing of cloud services and the diagnosis of application slowdowns. At the recent Biennial Conference on Innovative Data Systems Research, the researchers laid out their vision for DBSeer. And in June, at the annual meeting of the Association for Computing Machinery's Special Interest Group on Management of Data (SIGMOD), they will unveil the algorithms at the heart of DBSeer, which use machine-learning techniques to build accurate models of performance and resource demands of database-driven applications.
DBSeer's advantages aren't restricted to cloud computing, either. Teradata, a major database company, has already assigned several of its engineers the task of importing the MIT researchers' new algorithm—which has been released under an open-source license—into its own software.
Barzan Mozafari, a postdoc in the lab of professor of electrical engineering and computer science Samuel Madden and lead author on both new papers, explains that, with virtual machines, server resources must be allocated according to an application's peak demand. "You're not going to hit your peak load all the time," Mozafari says. "So that means that these resources are going to be underutilized most of the time."
Moreover, Mozafari says, virtual machines are, by design, isolated from each other: They can't share resources, even when they're running on the same physical server. With databases, that can mean wasteful duplication of a great deal of data.
And even the provisioning for peak demand is largely guesswork. "It's very counterintuitive," Mozafari says, "but you might take on certain types of extra load that might help your overall performance." Increased demand means that a database server will store more of its frequently used data in its high-speed memory, which can help it process requests more quickly.
On the other hand, a slight increase in demand could cause the system to slow down precipitously—if, for instance, too many requests require modification of the same pieces of data, which need to be updated on multiple servers. "It's extremely nonlinear," Mozafari says.
Mozafari, Madden, postdoc Alekh Jindal, and Carlo Curino, a former member of Madden's group who's now at Microsoft, use two different techniques in the SIGMOD paper to predict how a database-driven application will respond to increased load. Mozafari describes the first as a "black box" approach: DBSeer simply monitors fluctuations in both the number and type of user requests and system performance and uses machine-learning techniques to correlate the two. This approach is good at predicting the consequences of fluctuations that don't fall too far outside the range of the training data.
Often, however, database managers—or prospective cloud-computing customers—will be interested in the consequences of a fourfold, tenfold, or even hundredfold increase in demand. For those types of predictions, Mozafari explains, DBSeer uses a "gray box" model, which takes into account the idiosyncrasies of particular database systems.
For instance, Mozafari explains, updating data stored on a hard drive is time-consuming, so most database servers will try to postpone that operation as long as they can, instead storing data modifications in the much faster—but volatile—main memory. At some point, however, the server has to commit its pending modifications to disk, and the criteria for making that decision can vary from one database system to another.
The version of DBSeer presented at SIGMOD includes a gray-box model of MySQL, one of the most widely used database systems. The researchers are currently building a new model for another popular system, PostgreSQL. Although adapting the model isn't a negligible undertaking, models tailored to just a handful of systems would cover the large majority of database-driven Web applications.
The researchers tested their prediction algorithm against both a set of benchmark data, called TPC-C, that's commonly used in database research and against real-world data on modifications to the Wikipedia database. On average, the model was about 80 percent accurate in predicting CPU use and 99 percent accurate in predicting the bandwidth consumed by disk operations.
"We're really fascinated and thrilled that someone is doing this work," says Doug Brown, a database software architect at Teradata. "We've already taken the code and are prototyping right now." Initially, Brown says, Teradata will use the MIT researchers' prediction algorithm to determine customers' resource requirements. "The really big question for our customers is, 'How are we going to scale?'" Brown says.
Brown hopes, however, that the algorithm will ultimately help allocate server resources on the fly, as database requests come in. If servers can assess the demands imposed by individual requests and budget accordingly, they can ensure that transaction times stay within the bounds set by customers' service agreements. For instance, "if you have two big, big resource consumers, you can calculate ahead of time that we're only going to run two of these in parallel," Brown says. "There's all kinds of games you can play in workload management."
Explore further: IBM Offers Paid Support for No-Cost Data Server
Paper (PDF): Performance and Resource Modeling in Highly-Concurrent OLTP Workloads