Tips for Resilient API Design

I keep getting asked about my thoughts on how to make a responsive useful API for a service. I have jotted down my notes here in an attempt to highlight my reasons for each argument I make. Hopefully others might find this interesting (or come across them via Google).

Use headers to convey upcoming events / Information

In the early days of computing administrators displayed a message to users logging into their servers. This ‘Message Of the Day’ (MOTD) was useful for letting users know of upcoming service interruptions or maintenance. You can accomplish this same task by adding a HTTP response header with a well defined format (plain text UTF-8). Something like Warning: “Site will be down while we reticulate splines from 17:00 UTC to 18:00 UTC” will give callers the ability to log a warning, display something to the user, or ignore if they are so inclined.

Another aspect of this is using a well defined header to indicate that a call is being, or has been, removed. This allows a client to track if they are calling deprecated or dead endpoints which in turn will allow them to update and fix things, thereby reducing the support burden on both you and them. Define a header in your API like Deprecation: "<timestamp>" where timestamp is seconds since epoch, or even a fixed date. Any call can set this header and ideally your example SDK tracks it via a counter and logging where appropriate.

Always return information where possible.

The HEAD method in REST is useful when the only response desired is a HTTP status code indicating success or failure. Neither the request, nor the response will have a response body. The problem is that many HTTP clients handle HEAD requests badly. Even the venerable debugging tool curl requires the caller to use –head, otherwise curl will hang waiting for the response which the server will never send.

One of the most common uses of the HEAD resource in REST is for a basic ping functionality. These get added so a client can check to see if authentication credentials are correct and that the path to the API works. This is the first resource people will use, and when they inevitably forget to add –head they will get frustrated and start filing support tickets.

This might get the REST people totally fired up, but using HEAD in HTTP is kind of like writing a function that returns void in C. If you can return information you should, if for no other reason that adding the ability to give users information later. If the ping function above returned a body then it could return information like “The account has been suspended." or “Valid credentials, wrong end point." These messages can be useful to somebody debugging with curl/Developer Tools since they can see the response clearly states “Your authentication has expired" rather than having to see some 4xx error code that may or may not make sense, or worse no output at all with curl’s default mode that hides headers. HEAD, like a void function, is technically correct, but it can be valuable to instead steer users to something that will reduce your support burden whenever possible.

Put meaningful data in API tokens to reduce DDoS Load

I see companies give out authentication tokens all the time that are basically UUIDs, or random blobs of text. This is unfortunate as it removes the ability to do truly useful things with the token. Using AES128 you can encrypt 16 bytes of data. Before going into the advantages of doing this let me first define the caveats:

Since you, like me, are not a crypto expert you could easily make a small mistake that will allow clients to break the key that you are using. Keeping this as your mindset will ensure that you don’t accidentally give a caller the ability to gain privileges that they shouldn’t have.
Because of the above, assume that the user already has your AES key. With this assumption in place you can prevent a privilege escalation by ensuring that every decision you make is something that removes the user’s access and never grants access.

So given the caveats above, why would you do this? Well, you can easily reduce the overhead on your authentication system if you can use this data as a fence directly on your front end. While this doesn’t eliminate the need to validate the token or for authentication, it does allow you to reject requests much faster and more cheaply than a typical cycle through a data storage layer. Here are some ways to use this to reduce overhead:

You can encode the expiration time for a key into the data. This allows you to cheaply reject a key that has already expired. Though if you are doing this it is best to not use unix epoch directly since your starting point can be more recent, thereby eliminating a lot of otherwise unused space.
You can establish if a given key is being used on the right end point. If you have a read only API key you can quickly terminate a write call with an error without even hitting the authentication system to see if the key is valid.
You can version keys which in turn allows you to invalidate all existing keys quickly, and to ensure that old keys consume very little resources to invalidate. This is very useful for DDoS mitigation where the attacker is purely trying to consume back-end resources.

Tell the user when to retry

In any error reply it is very important to tell the user when they should retry. Don’t assume that they will back off properly. Obviously the client needs logic for network outages and stuff, but at the server level you can help them out greatly. For example, if you know that your service will be in a maintenance window for the next 30 minutes you might tell the client to back off by a full minute rather than just 1 second or some other default. This lets the client know that it might be a while, and also reduces load on your services. It also becomes far easier to block and ban somebody initiating a DDoS by simply banning clients that excessively violate the back-off given to them.

Along these lines you can make sure that you include a “maintenance mode” or “expected outage mode” in your error responses. This lets the user know that the action they are taking won’t work for a while but that it’s not an issue that requires notifying support. Effectively, build your status page into the API response.

Ideally you can also use existing load on the system to establish a retry threshold. If you have 10 connections to Nginx then retry right away, 50 mean retry in 10ms, 500 means 100ms, and 10,000 means 1,000ms. This should help shed incoming load and therefore help you to not just time out, failing badly.

Tell the client what to do in the error reply

All too often I see documentation for error codes that includes levels upon levels of different types of codes. Not only is a status returned, but a code is also shoved into the JSON error message. A status of 502 is an internal error, and code of 20 is a permission problem, vs 40 which is an error talking to a back-end server, etc. This leads to client code with endless switches and custom error types, etc.

The question that I have for somebody designing an API is this: Do clients really care? The client rarely cares what obscure error happened on the back-end. They want to know what to do next. In your error response, return a human readable string describing the error, as well as some simple boolean fields that tell the client what to do. Basically the goal is to avoid something like Faceook’s error codes: http://fbdevwiki.com/wiki/Error_codes

You can tell clients: “The resulting JSON error message will include a field letting you know how many milliseconds to sleep before retrying and a message describing the error, and if the HTTP status is 400–499 then the query won’t work if retried, and if its 500–599 then it can be retried.” There may be API specific items to include, but basically think through what the code path after the error is handled is going to look like and make the error include only the information clients care about.

Mark every request with a specific request ID

Every request should reply with a header like Request-ID: <some identifier>. Ideally this identifier gets generated on all inbound requests and then is attached to all requests flowing through the system. This allows you to see that the request came into server x, hit service y, and replied with a 201. This gets better if you think things through a bit and attach the request ID as a comment in MySQL queries and such so you can see them and track down what initiated the “delete everything" command.

Even better is returning this to the user. This allows the user to tell you about a request failure, including the identifier so that you can then track down the issue quickly rather than having to try and figure out which link in your endless http logs represents their request. Combine this with Kibana for very quick debugging cycles.

And if you are a developer tool you can even accept a Request-ID from the client. This allows them to extend their identification system all the way into your service. Always use this beside your own, internally generated ID otherwise you might find that a client is using the same ID for every request. Note, though that you should sanitize this input, otherwise it might be the source of your next XSS. Force it to be formatted in a very simple manner like alphanumeric only and 32 characters long.

Set your User-Agent

If you have control over the client libraries then it’s very useful to set the User-Agent header to include more than the default values. For example, include the version of the client library, the build of the software and even the version of the runtime (if it’s dynamic). This gives you an idea of who is using your API and how out of date they are. This seems simple but it’s amazing how often I see the user agent be more or less just the same old iPhone agent that always gets repeated. It’s also worth noting that you can vastly reduce the length of the agent as well, since the typical agent is spammy and long.

Add request headers to help yourself troubleshoot.

Assuming that you have rough control over the clients that are accessing your API you can easily include valuable information to help debug issues with your site. For example, when a client fails to connect to your site you have no information about why, or how long, etc. Why not have the client let you know when they are able to communicate again? After all, the best monitoring in the world won’t tell you when a single ISP broke routing to your servers for a few hours.

Adding a header like Last-Failure: "[Error -2]: Name of service not known" will give you valuable insights about why your service suddenly saw a drop in traffic. It will also let you know if there is a persistent issue that isn’t being detected with monitoring. I debugged an issue recently where the servers would return an error and leave the client stranded. This would happen several times a day for periods extending into the hours range. But it all appeared to be completely random. Problem was that we couldn’t even really see where the errors were because it was all external to our network! If our client libs had exposed this error via the next successful request we would be much better off for debugging.

Another example of how you can make this help is using a Retry: 10 header that defines how many times your client has retried the request. This lets you know how bad of a failure your users are experiencing. Seeing a load of 1’s would let you know that you served a wide trough of 500’s, but each user was minimally impacted. Seeing a handful of 50’s would let you know that your issue was far more impactful to a single, specific user set.

WARNING: Since this header is user provided it’s important to treat it as Non-Authoritative. The user might have shoved “500000” in the Retry header thinking that it will get them priority (it shouldn’t) or that it will be funny to watch people get paged (it shouldn’t) or because they increment with every request.. Same with the Last-Error header.. They can cram something in there to make you think that it is a ongoing issue, so take the hints with a grain of salt. Otherwise good clients with trustworthy track records are more reliable than random IPs hitting you from all over the world.. etc.

Stop trusting DNS so much!

In most modern production environments the DNS records are configured with a fairly short TTL (Time-To-Live), which allows the destination of the records to be changed quickly. This allows you to simply direct clients away from a dead data-center, or broken load-balancer. This is a good setup, and I highly recommend using it, but setting a TTL to 5 minutes does not mean that traffic to the old destination will stop after the 5 minute mark.

You see, lots of ISPs, some older operating systems, and even some popular programming languages ignore TTL settings, either caching far too long, or worse, caching forever. This means that a dead load-balancer, or a failed server can cause your site to be down for hours (or days in some cases) with there being nothing you can do about it. Even worse is that it will be really hard to debug (and monitor). It will be out on Verizon, but not Comcast, or failing for Java, but not dig.

So, how do you get around this issue? You can’t really set your TTLs lower, and you can’t contact all the providers that are broken because you might not even know who they are! One of the easiest ways I have found is to do cache busting at the DNS level. In order to do this you will need a wildcard DNS entry, along with a wildcard SSL certificate which is signed for both api.example.com and *.api.example.com.

Once a client has a request fail for network reasons, you switch over to using a cache busting domain name until a request succeeds. Once the request succeeds you can fall back to the normal name, or continue using the cache buster name until it fails, then you generate a new name. Typically a V4 UUID can work great for this. Basically you query api.example.com right up until it fails, then you switch to 049b4a2e-e4ba-46e1–9aa2–10061233d902.api.example.com. Note that you should also switch out the cache busting name if it continues to fail for more than a period.. say 30 seconds or so.

This gives you the ability to weather out a situation where Verizon Wireless has cached the IP address of your API server and refuses to release it, even after your ELB changed IP a week earlier.

Not in the APEX!

It is the trend right now to put the forward facing service on the APEX of the domain (example: Twitter, Slack or Stripe). This can be okay but it limits your ability to load-balance your main page. This gets even worse if you are using it for an API endpoint. The reason for this is because an APEX name can not be a CNAME (pointer to another host), it must be a direct IP address.

Problem is that most services require you to CNAME to them in order to use them. For example, Fast.ly, Amazon’s Elastic Load-Balancers (ELB), Heroku, so on and so on. If you hardcode the IP address then you risk your site going offline if your provider changes the IP while you were not looking.

Now, there are ways around this. For example: Amazon offers Route53 which can put an ELB at the APEX, and CloudFlare will flatten out a CNAME if you are willing to put them in your critical path. Both these solutions can greatly benefit the web page side of the coin, but for your api try to avoid putting it directly under the APEX. Use an alternate name like api.example.com, or some such. This will vastly simplify your life down the road. Phil Pennock also pointed out another advantage of this: The ability to setup TLS configuration differently for your web site, which requires compatibility with old, less secure browsers. Your API can run up-to-date crypto settings that might fail on a 5 year old Internet Explorer.

Another suggestion that Phil Pennock made was to use a version number for your API in your hostname, rather than as a path. This allows you to split traffic far easier, and to be wind down old services as the traffic switches to the new version. It also means that you can keep an old, cruft TLS configuration for the old version, and updated setting on the new. Most of all though it allows you to simply remove the DNS name when you are done supporting a very, very old version of your API.