iTrack: Correlating user activity with system data
Human error has been identified one of the major factors behind system outages and network downtime in a number of previous research papers and surveys. Gartner statistics show that almost 40% of unplanned application downtime is caused due to operator errors such as unintentional changes to network configuration resulting in a network outage, patch installations, service restart, etc. Yet, system admin activities on production IT systems are rarely properly logged and monitored. Existing tools to track user activities either produce too much information without any hints of a potential outage scenario or too little information to be useful in a meaningful way.
In this paper, we describe the design and implementation of iTrack - a framework for monitoring user activities and correlating them with system data. iTrack makes use of commonly available native monitoring and diagnostic utilities on operating systems to monitor systems events as well as system admin activity, correlates these two sets of information and categorizes the activity as potentially abnormal or harmful based on its impact on the system in terms of file system, network and process activities. We demonstrate the usefulness of iTrack through several use cases and real world examples such as detecting and diagnosing system outages in real time, conducting post mortem analysis of outages, and maintaining audit logs. Our experimental evaluation of iTrack confirms that its monitoring overhead in terms of CPU time, activity completion time and data generated is within the tolerance range of most production systems. In cases, where the overhead was found to be unacceptable, we detect the underlying cause and provide solutions. These solutions improve performance by up to 20% to 90%, in terms of managed server and iTrack server CPU utilization, respectively and by up to 2 times in terms of completion time of certain system admin activities on the managed server.