Case studies:
Data Analytics for Computer Systems Management

analytics1 Challenge

The ability to predict service problems in computer networks, and to respond to those predictions by applying corrective actions brings multiple benefits. First, detecting system failures on a few servers can prevent the spread of those failures over the entire network. For example, low response time on a server may gradually escalate to technical difficulties on all nodes attempting to communicate with that server. Second, prediction can be used to ensure continuous provision of network services through the automatic implementation of corrective actions. For example, prediction of high CPU demand on a server can initiate a process to balance the CPU load by re-routing new demands to a back-up server. Despite all these advantages, there is a great challenge in the prediction of computer systems: how do we invoke the right predictive algorithms that match the characteristics of the problem at hand?




  • To make predictions we first collect historical information. As an example, monitoring systems capture disk utilization on a server by aggregating data over short time intervals. In this case, our historical information collects values characterizing the status of the computer at different time steps. Each value represents the amount of disk utilization, memory utilization, and CPU utilization.
  • We then look for a technique matching the characteristialanyticscs of the problem. Important factors are the discrete or continuous nature of the data, if observations are taken at equal time intervals or not, if the data is aggregated over time intervals or corresponds to instantaneous values, etc.



The experiments were conducted on a central database with the performance of thousands of IBM AS/400 computers. The predictions were formed for six important parameters: response time, maximum response time, CPU utilization, memory utilization, disk utilization, and disk arm utilization. For all six performance variables under study applying the model on the learning approach yielded significant gains in accuracy.


Conclusions and Benefits

This case study shows how predictive algorithms play a crucial role in systems management by alerting the user of potential failures. Improving accuracy during computer performance prediction can save millions of dollars to critical solutions where continuous service is required (e.g., bank applications, security systems, etc.).