“They lost a horseshoe because of an unstemmed nail, they lost thier horse because of a lost horseshoe, a messenger did not deliver a message because of thier lost horse, they lost the war because of an undelivered message…”
taken from a Japanese parable
Infrastructure monitoring is a critical task for any system administrator. In order to collect information from various devices, such universal monitoring systems as Zabbix are most often used. The data can be visualized using the built-in tools, but they are quite complicated for customizing. Therefore, a third party tool like Grafana is often used. On the other hand, Zabbix notifications are configured through email notifications or through integration into a corporate messenger, such as Slack. In any system, even in a well-established one, there are critical nodes, the failure of which can lead to the most disastrous consequences. Monitoring these nodes allows you to predict failures or make the right decisions if something goes wrong. The paradox is that the monitoring system itself is a critical node.
It is clear that the reservation of such a node is the first thing to think about. Fortunately, Zabbix supports a proxy for collecting data on behalf of a centralized monitoring server. However, the resiliency work does not end there. Under certain conditions, a communication channel can become a weak link, even if it is reserved.
Most companies rejected the concept of a “server room in the office” long time ago and place their equipment in data center racks. This eliminates many problems with providing a comfortable microclimate for equipment, loss of connectivity or power supply, but not everything is as perfect as we would seem to be. Today we will talk about three cases when the monitoring system cannot notify the system administrator.
The IT infrastructure considered here is a server rack with the equipment located in a Tier III data center, that is, all systems are reserved at the provider level.
Possible problems
Lack of connectivity
Picture 1 — Fault at the edge router
Most often, an edge router is installed to provide a perimeter between the WAN and the local area networks. Its task is simple – to provide the necessary devices with Internet access, and if the provider’s main communication channel is not available, switch to the reserve.
What can happen if the monitoring server is inside the perimeter, but the edge router fails? In this case, monitoring messages will not reach the system administrator. Globally, the failure will become visible as a loss of connectivity with the rack, but to find out the true cause, it will be necessary to involve the technical support of the provider, whose engineers can look at the physical condition of the equipment. It is not possible to take any actions until it is clear and understandable what happened.
DDoS-attack
Picture 2 — sending notifications bypassing the main communication channel
In the same way as in the case of a failure of the border router, a DDoS attack aimed at clogging the communication channel leads to a complete or partial unavailability of the service in the same way as in the case of a failure of the border router. Under the conditions of “massive bombardment” by packets of notifications from the monitoring system can either arrive with a delay or do not arrive at all. All this time the service will de facto not work, and the system administrator will need to take control of the situation.
The solution can be either temporarily sending all packets in black-hole, or switching on protection against attacks. The faster this is done, the less downtime there will be. Correct monitoring configuration plus notifications received in time will allow you to minimize damage from the attack and take the right action.
Unavailability of the mail server
Picture. 3 — sending notifications when the main mail server is not available
In the majority of cases, notifications from monitoring systems are configured using e-mail. This is convenient because notifications are sent immediately to responsible people, and events can be restored from the history of the mailbox with an accuracy of a second. Under what conditions will Zabbix fail to deliver notifications? This can happen if the mail server stops working for some reason.
Inspite of the popularity of various third-party mail services, many system administrators prefer to use self-configured mail servers within an organization, to which they most often bind the delivery of notifications.
What is missing here?
In all three described scenarios, problems appear due to the absence of a so-called witness-node that does not depend on the communication channels of the data center providers. One of the interesting options for organizing such a node is a NetPing Monitoring Solution GSM3G R61.
Picture. 4 —the external view of the device
The device with form-factor 1U is designed for installation in a standard 19 “rack. Its main task is to monitor the status of various sensors:
This equipment is ideal for monitoring the status of a server rack, and a GSM modem allows you to deliver SMS notifications withoге reference to an Internet channel, but his capabilities are not limited to this function. In addition to the standard functions of the device, one of the undocumented features is sending Email notifications using a backup GPRS channel, which is provided by the built-in GSM modem.
Customization and features
First of all, let us define the logic of interaction. When a certain event occurs, a trigger in Zabbix should call a RECEIVING NOTIFICATION ON UNSUP custom notification script. This script accesses the /at.html page of the device web interface, sequentially executing commands to establish a connection, send a message with the content of a notification, and fault in the connection.
A NetPing device interacts with a GSM modem using AT commands, which makes it easy to create a bash script that sends commands using curl.
Let us look at the form /at.html
<form name=”frm”>
<input name=”cmd”/>
<input type=”button” value=”Send” onclick=”send()”/>
</form>
Everything is simple here – to send a command, we need to pass:
curl –user <login>:<password> -d “cmd=<AT-command>” http://<address of the device>/at.html
The response from the executed command is requested by the CGI script:
curl –user <login>:<password> http://<address of the device>/atget.cgi
Now it is only necessary to understand which AT commands should be operated. Let us divide them into two groups: connecting to the Internet and sending a message.
Connecting to the Internet
Before establishing the connection we need to set a few variables:
-
AT+SAPBR=3,1,«CONTYPE»,«GPRS» — we are setting the type of connection
-
AT+SAPBR=3,1,«APN»,« the APN address of mobile operator» — we are setting the name of the access point.
-
AT+SAPBR=3,1,«USER»,«The name of the user» — we are setting the name of the user
-
AT+SAPBR=3,1,«PWD»,«The password» — we are setting the password
Now we can set the connection using the command:
AT+SAPBR=1,1
Sending Email
As with setting up a connection, we define the variables first:
-
AT+EMAILCID=1 — we are setting a CID parameter for an Email session
-
AT+EMAILTO=30 — we are setting the timeout value for the SMTP server.
-
AT+SMTPSRV=”The address of email server”, Connection port — we are indicating the address of our mail server and the port for connection.
-
AT+SMTPAUTH=1,”The user name”,”The password” — data for authorization on the mail server.
-
AT+SMTPFROM=”E-mail address”,”The name of the device” — we are indicating the Email address and the device name as the sender’s name.
-
AT+SMTPRCPT=0,0,”E-mail address”,”The name of the addressee” — we are indicating the Email address and the name of the addressee
-
AT+SMTPSUB=”The subject of the letter” — we are indicating the subject of the letter
-
AT+SMTPBODY=”The text” — the text of the notification
We are sending the letter:
AT+SMTPSEND
We are completing the connection:
AT+SAPBR=0,1
The notifications were sent to the system administrator, even if the major and backup communication channels in the data center were not available.
The summary
We considered all three possible problems in which messages from the monitoring system could not reach the system administrator. NetPing Monitoring Solution GSM3G R61 will easily solve these problems due to its autonomous work from the communication channels of the infrastructure provider and will be able to notify about the presence of a failure via e-mail immediately .
This device in a server rack provides more substantial control over the situation and provides the following benefits:
- determining the source of the problem (whether the provider is to blame for the failure or the problem is local);
- more precise planning of the RTO (the period of time when the system will be not available);
- clear understanding the condition of the equipment, without involving the technical support of the infrastructure provider.
The sooner a system administrator receives all of the information presented above, the sooner the problem will be resolved, and the business will receive objective forecasts of service recovery times.