I wrote previously about performance monitoring in .NET, but haven’t touched on the monitoring of Linux machines. There are a huge number of monitoring systems for Linux, but as with .NET performance monitors there isn’t much discussion of how they operate. This article points to a few of the bigger monitoring systems and looks at how they obtain their results.
The paper A Taxonomy of Grid Monitoring Systems presents a good overview of existing systems and their architectures – most of which are designed to be run over large clusters, rather than individual machines.
If you’re looking to use a monitoring system I’d advise reading this paper. Most systems look to monitor the same aspects of system performance, but differ in their methods of storing data and handling distribution. Consequently there isn’t any one system which provides an all-encompassing solution, but there should be one to fit your needs.
Another system which isn’t mentioned in the survey paper is Collectd, which is perhaps a more basic tool, but still contains enough functionality for most people. The best thing about Collectd(from my perspective) is its documentation. Their website lists the aspects of the system that can be can be documented, and goes as far as stating what Linux functions (or files) are used to obtain the measurements. If, like me, you’re looking to do some basic low-level measurements on your own this is incredibly useful.
If you’ve looked at the links above and concluded that the frameworks are too complex for your needs then I’d recommend looking at how these systems obtain their measurements – they generally all do the same thing. The source code for Collectd is the most readable, with each aspect (Disk, CPU, etc.) being split into separate files, largely free from verbose middleware code.
In Linux most stats can be found in the /proc folder – a pseudo-file system residing in virtual memory. Because it acts as a file system you can use ls to browse the available information.
This Red Hat page goes through many of the interesting files in /proc. Of these the following are of most interest:
- /proc/cpuinfo – Identifies the type and speed of the processors in this machine.
- /proc/loadavg – provides load averages for 1, 5, and 15 minute periods. For many applications this may be all you need to monitor, as it provides a good idea of the systems load without using any intrusive performance metrics.
- /proc/meminfo – Provides information on the system’s memory usage, including the total amount of memory available and the total amount currently free.
CALCULATING CPU UTILIZATION
Sadly none of the files in /proc report CPU utilization directly – /proc/stat reports the amount of time the processor spends in user time, nice time, kernel time, and idle time respectively, but easy to parse. If these values aren’t sufficient you can implement a function to calculate the percentage CPU utilization for a given period, or use a program which already does this.
This CyberCiti post provides a good overview of the latter. I found these to be slightly excessive for my own needs, so wrote something myself. I’ll try to post that at a later date.