Intelligent algorithms can help reduce disruptions in online services
When the Swedish Tax Agency website went down for nearly two days, thousands of people across Sweden were left frustrated. Users of Gmail, Slack, Facebook, Apple services, and other popular services also regularly experience delays and this type of disruptions. Olumuyiwa Ibidunmoye has developed automated algorithms for troubleshooting to prevent prolonged delays or disruption in services hosted in cloud computing servers.
The occurrence of operational issues in online services have become common and harder to manage today due to the rapid growth of Internet services, the computing infrastructure on which they run as well as the unpredictable volume of user traffic. Service delays and outages have both financial and operational implications. For example, a single hour-long episode may cost an e-commerce website like Amazon.com millions of dollars in lost sales while service vendors spend a lot of work-hours to restore services.
The main challenge here is that the problem may be due to a variety of reasons. Delays may be caused by coding errors, inadequate server resources, or due to competition between hundreds of applications running on the same servers.
Hence, to minimize impact or prevent re-occurrence, there is a need to continuously measure the status of systems in order to address two troubleshooting concerns. The first is how to detect and diagnose symptoms of problems, 'anomalies', such as unexpected spikes or dips, in service status over time. The second is how to intelligently determine and execute corrective actions to fully restore services back to operation.
Olumuyiwa Ibidunmoye's dissertation introduces an automated approach for addressing these two concerns in the context of problems caused by inadequate server resources and effects of having multiple applications share the same servers in cloud computing systems.
"I have developed and investigated techniques that automatically uncover symptoms of problems and intelligently rank them with limited human intervention, adapt to changes in the state of the systems, and resolve service delays by incrementally adjusting capacity of server resources in response to demand," says Olumuyiwa Ibidunmoye.