Effective Application Alerting

Tim Costa

30 Aug 2016 • 4 min read

Application downtime is an inevitable reality - even companies like Facebook, Apple, and Google suffer from instability at times. Alerts are the first line of defense when it comes to downtime, and effective alerts can be the difference between total downtime and catching an issue before it spirals out of control.

Alerts are typically managed with a service, and consist of two parts, the trigger and the response. Let's dive into common triggers, responses, and services we can use to manage the alerts.

Triggers

Here are some common triggers and scenarios in which you would use them.

Response Time

This is a common trigger used by marketing sites where fractions of a second can make the difference between a user converting to a lead or becoming just another bounce. Sites like NewRelic allow you to specify a threshold that your response times should be under and can trigger warnings or alerts when response times exceed that threshold.

Status Code

Every request to a web server is responded to with a status code that tells the client some basic information about the response before it parses that actual data. Generally, a 2XX response code means success; 3XX means it's a redirect and you can't necessarily determine success; 4XX means you sent the server some bad data, whether it be in the body of the request or your authentication headers; and 5XX means something went wrong server-side.

This is among the most basic of triggers, and you generally want to set up some kind of monitoring on a route that checks connections to databases and external services on which you rely and returns appropriate status codes. Here's an example of that for a Hapi.js-based Node server:

server.route
	method: "GET"
	path: "/api/health"
	config:
		auth: false
		handler: (req, reply) ->
			health =
				cache: cache.isReady() # Redis connection check
				loadAvg: os.loadavg()
				hostUptime: "#{os.uptime()} seconds"
				server: server.load # Basic node server stats
			# This select checks to see if any database connections are alive
			Information.knex.raw("select 1 as dbIsUp").asCallback (err, result) ->
				health.database = not err?
				# External service check
				Request {
					method: "GET"
					url: "#{process.env.EXT_ROOT}/api/health"
				}, (err, resp, body) ->
					body = JSON.parse(body) if typeof body is 'string'
					health.EXT_PING = body
					health.EXT_ERR = err or body if err? or resp.statusCode >= 300
					response = reply health
					if not (health.database and health.cache and not health.EXT_ERR)
						return response.code 500
					return response.code 200

End to End Tests

E2E tests are also effective triggers. You can schedule your tests to run on an interval and trigger an alert if any of your workflows do not pass their test suite.

Responses

There are only really two different categories of response: warning or critical.

Warning

These responses are usually posted to Slack or emailed to the team responsible for the application. Warnings are useful when a value (such as response time) exceeds a threshold, but is not yet indicative of a problem. This is a good early warning system, but does not necessarily need to be responded to immediately.

Critical

Critical alerts should mean that an application has downtime, and the alert needs to be responded to immediately. Use these alerts when your health check starts returning a 5XX status code or your average response time jumps an order of magnitude. These alerts generally will trigger a service like PagerDuty to call the on-duty Engineers and Product Managers to alert them immediately.

Common Application Checking Services

These websites and services provide you with the ability to create triggers for different types of alerts and to define the responses. Where they differ is mostly the granularity to which you are able to check.

Pingdom

Super simple and straight to the point, Pingdom is great for simple HTTP(S) checks against a server and can look for a string of text in the response.

NewRelic

NewRelic allows you to configure alerts based on the response code from a specific URL as well as monitor the response times of individual or aggregated routes. This is great when you have routes that may take a long time (like document uploads) that you don't want to aggregate into your average application response time.

Sensu

Sensu is great if you have the bandwidth to maintain an additional application. It allows you to create process checks to ensure that your supervisor processes are running and can run scripts that check the health of the application for you. Here's a simple bash script I wrote to exit with code 0 on a 2XX and a code 2 on any other code:

#!/bin/bash

curl -fs localhost:8000/api/health

if [ "$?" != "0" ];
then
    curl -v localhost:8000/api/health 2>&1
    exit 2
else
    exit 0
fi

This script considers anything other than a 2XX an error as this route will never redirect and only return 2XX or 5XX. The 2>&1 at the end of line 7 pipes stderr to stdout so Sensu is able to include it in the email alerts sent to us.

My Recommendations

I personally use a combination of all of these systems to monitors applications I write both personally and at 2U.

Pingdom is great for basic checks such as whether or not a user can connect to your application from the outside world, but it is not capable of isolating where a problem lies.

I use NewRelic extensively to monitor application response time and application load. NewRelic is also configured to hit the health check URL of the sites I maintain and trigger a critical response if the response is not a 200. NewRelic can tell me which routes are exceeding their configured thresholds and even provide insights into what part of each route is taking longer, whether it be calls to a database, cache, or external service.

Sensu is great for checking the status of applications on an individual server level, but isn't great for measuring true uptime. I say this because a lot more goes into uptime than whether your site is running on localhost. Sensu is not as good at telling you whether users around the world can connect to your server. Sensu is a great choice for monitoring the health of your physical hardware, not so much the health of your application.