Osi Models - System and Network Monitoring
Essay by review • November 26, 2010 • Term Paper • 2,327 Words (10 Pages) • 1,767 Views
System and network Monitoring
INTRODUCTION
As computers have gotten smaller and networks have gotten bigger, most of us have found ourselves worrying about more and more machines and network devices. In the old days, the typical installation of a small number of central servers, a larger number of ASCII terminals, and a few point-to-point serial or network links meant that a lot of system monitoring could be handled by periodic manual inspection or a few shell scripts, cron jobs, and mail messages.
These days, when there seems to be at least one server for each possible function; when everyone has a machine with what used to be thought of as major processing power on their desk; when networks are bigger, more complicated, and smarter; and when everything from modems to printers to soda machines is network-connected, local monitoring just isn't enough. Virtually every site needs some form of distributed or network-based monitoring mechanism, if only to have some ability to keep track of the worst problems.
In this article I'll discuss what monitoring is, along with where and why you might want to use it, typical components of a monitoring system, and some criteria against which to measure different monitoring systems and tools.
This is the first article in a planned series on system and network monitoring. Future articles will examine a variety of monitoring software packages, measure them against the evaluation criteria, and attempt to discuss the pros and cons of each and identify where the software might be most appropriately used. I'll primarily be examining open source and freely available software, but I'll also try to cover some commercial packages.
And, for the benefit of those who are unfamiliar with professional concert sound-reinforcement systems, I promise that I'll try to avoid attempting weak puns about asking for more SNMP in the monitors.
What Is Monitoring?
Monitoring is primarily intended to identify what has gone wrong or is about to go wrong. In general, monitoring systems can be thought of as having four components:
Data collection and/or generation
Data logging or storage
Analysis, comparison, or evaluation
Reporting and exception alerting
Basically, you collect some data points, stash them somewhere, compare them against established limits or failure indicators, and raise a flag if something's wrong.
That is a bit of a simplification, but the basic truth is there. Fortunately, most monitoring systems are a little more sophisticated than the bare-bones description above.
Data Collection
Data collection usually takes a few different forms, but most forms can be classified as some sort of probe. Some examples of common probes are:
ICMP pings to indicate network connectivity
Simple port probes (e.g., does a TCP/IP connection to port 80 succeed?)
SNMP queries to determine specific states or activity levels
(SNMP is, of course, the Simple Network Management Protocol -- for an introduction to SNMP, see Elizabeth Zwicky's articles in recent issues of ;login:.)
Data points can also be generated and submitted by a system or network element to the monitoring system, through SNMP traps, mail, or some other form of network connection. An obvious example of this is the use of a centralized syslog host that receives syslog messages from various hosts on the network.
Data Logging
In many cases, the collected data is logged in a fairly basic way, often through syslog or some flat file. Some systems log every data point they receive or generate; some log only the "interesting" ones.
More sophisticated logging mechanisms can make it easier to identify trends or multiple failures that are due to a single cause (such as a loss of connectivity that's due to a router or communications link failure). Logging mechanisms such as relational databases can add some complexity, but can also make certain kinds of reporting easier and more effective.
Analysis
In most cases, monitoring analysis takes the form of an immediate, realtime, good/no-good decision. For example, failed pings would normally be assumed to indicate a machine or network connection that's down, and a disk that reports 99% full may call for some attention.
Some systems can correlate multiple failures that are due to a single cause (that communications link mentioned above), and some can react or escalate after multiple consecutive failures (e.g., call your boss if the Web server doesn't respond for the third time).
The other type of analysis that is sometimes overlooked is trend reporting -- if you can notice trends while they are happening, you'll have a better chance of adding more disk space to the /news partition before you run into major problems. I haven't (yet) seen this myself, but it would be nice, in a twisted sort of way, to get an automated message noting that disk use has been increasing, and that if current trends continue, the disk will be full in 42 hours.
In general, better analysis gets you better results, but you'll likely pay for it in increased complexity and increased cost.
Reporting
Probably the first type of reporting that people think of in relation to monitoring systems is alpha or numeric pager messages (which of course always come at the worst possible time). But there is far more to a proper reporting system.
Reporting is generally concerned with three types of information:
Exceptions -- problems that should be reported in some form of "alert" for investigation, action, and resolution.
History -- specific data for specific time periods, for such uses as traffic-level and outage reporting, as well as usage or capacity-based billing.
Trends -- aggregated (typically) data used for trend analysis and capacity planning.
In general, two styles of exception reporting are used with monitoring systems: report everything, or report only the problems. And, furthering that distinction, do you report events only as they happen, or do you report on the current state, identifying all "unresolved" issues?
...
...