In this article I would like to describe my experience as well as my attitude to monitoring a server room. To start with, let us estimate a scale of a responsibility area. I have a small server room (two square meters), where there are only 2 racks: one for servers and another one for network equipment. A climate is maintained by a split system with two air conditioners (a main and a backup one) and we have a UPS there. In general, it is a typical server room of a small company. Firmware: DELL, NetApp, networks on Cisco, UPS APC, server: ОS Windows Server (AD, DNS, DHCP, application server and DB server), PRTG Network Monitor is responsible for monitoring.
One day I received a letter notifying me that my servers have got the End-Of-Support status (5 years have passed from the moment of purchase), and a management decided not to pay for extending a warranty, but buy new firmware instead. The old firmware had to go to playgrounds for developers and testers. At the same time a management warned that «the going gets tough» and there will be no money for further development unless absolutely necessary. Being concerned about durability and reliability of my servers, I brought up the issue of monitoring, not only for the equipment, but for the server room as well.
I used the next arguments to persuade the management to fork up a little more:
- CH 512-78 Technical requirements for buildings and rooms for installing computer equipment, particularly paragraph 3: Requirements to a micro climate and noise.
- A list of services and their downtime as well as the projects that will be affected in case of industrial application server failure.
As a result, it was offered to purchase additional equipment to prepare a fail-safe decision or buy a monitoring system for a server room. Money was assigned and critical services were agreed to be duplicated. However let us discuss monitoring of a room.
What we will monitor and how
I monitor firmware through an IPMI console ― from there I get data on a status of a motherboard, a processor, a hard disk, etc., including information about a temperature inside. What about the room, I will have to monitor its temperature and humidity.
To start with, let us answer the question, why do we need this?
- Temperature: it is necessary to understand that even the information about a temperature does not provide any understanding of a server room microclimate, because a temperature inside a server’s body can easily exceed 50 degrees! Even if we rely on temperature sensors inside firmware, getting an alert from the server will be a point of no return, because a server is not warmed up immediately after an air conditioner is turned off.
- Humidity ― an ideal humidity for server rooms is 40-60%. If a humidity value is less then electrostatic charge accumulates; if humidity exceeds this value then a moisture condenses and causes running of oxidation processes hereby reducing firmware wear out time.
What about a monitoring tool, I will be using NetPing Monitoring Solution GSM3G R61.
A wiring diagram is represented below to illustrate its connections:
I will not need all sensors shown on the picture, I will use only temperature and humidity sensors.
NetPing itself was installed into one rack, with 3 temperature sensors for each rack (at the bottom on the front, in the middle and on the top to avoid tyranny of averages), and 1 humidity sensor between the racks.
It looks like this:
Another rack has the same temperature sensors location. I configured the NetPing device for sending SMS as well as an alarm operation. Data that I receive from NetPing, are processed by PRTG (how to make PRTG work together with NetPing properly is a topic for a particular article). A task seems to be resolved, but is this a truth?
Monitoring is not only about gathering information
Properly configured alerts as well as diagrams on a big screen are only half of the task, because a reaction to an incident is even more important. There are courses ITIL and ITSM devoted to incidents management, therefore we will slightly discuss a precise monitoring situation.
To begin with, let us determine the metrics. Which norm is acceptable? After which threshold point and in what period of time what notification should work?
I set an upper limit for a temperature to 28 degrees of Celsius (82 degrees of Fahrenheit). When this temperature is reached, PRTG will send me an email 5 minutes after discovering this fact. I have set exactly this temperature value because having 28 degrees in the room in 15-20 minutes leads to 51 degrees of Celsius (124 degrees of Fahrenheit) inside a body. It is not a critical temperature value for firmware, but it reduces its wearout time. 5 minutes later 30 degrees of Celsius (86 degrees of Fahrenheit) is reached, an alarm is turned on in the room and messages to cell phones of mine, my colleague and my boss are sent.
Dealing with humidity is a little more complicated. We have two threshold values here: below 35% and above 65%. An alarm does not work here. Only an SMS is sent as well as a warning into monitoring.
Threshold adjusting in NetPing is represented below:
The diagrams from sensors after connecting them to PRTG are shown below:
Adjusting a report on humidity sensors.
A reaction to incidents
We were unlucky to use monitoring features soon because one of air conditioners failed and another one did not turn on. We discovered that there were no power supply at a split system and called a building maintenance. The issue was solved. To our benefit, the whole situation happened on a working day. But what we could do if an air conditioner failed on a weekend?
As a result of an internal discussion and negotiations with a business centre administration we agreed the next regulations. Any time when critical values of humidity and temperature are reached, infrastructure engineers and a building maintenance receive an email. Engineers and IT leaders receive and SMS message. In addition, the same message goes to a guard phone of a technical service. Afterwards IT engineers immediately connect a technical service to solve the issue. By the way, IT engineers have to previously define, who will be the first to react to an incident during off hours (who is going to go to an office if necessary).
We stick to these regulations because our system administrators work usual office hours and we have no shifts on duty. If we had a shift on duty, then a responsibility for a reaction would be laid on it.
In addition we gather data on room temperature to determine trends (for example, whether a cooling ability of air conditioners worsened or if there is a trend of increasing humidity).
Such reports can be used later to go to our administration and order a new air conditioner or a ventilation installation.