OverDrive Developers

SERVICES DEGRADED - Patron Authentication - Tuesday 24 June 2014 - RESOLVED

OverDrive API Partners

Over the past three weeks, we have been analyzing the traffic to our Patron Authentication API in an attempt to diagnose intermittent outages. We have found that the outages were caused by burst traffic to ILS servers (Integrated Library Systems, used for user authentication) which were not responding to requests or were terminating connections prematurely. Because of the structure of the Patron Authentication service, an unresponsive ILS may impact more than its own traffic through the Patron Authentication API. Furthermore, during that period of time, some incoming traffic will queue up behind the failing requests. To help protect any downstream ILS servers from being overwhelmed with failing traffic or a sudden burst of queued requests, we have implemented a "circuit breaker." The circuit breaker watches for a number of sequential failures from an individual ILS. If there are more than a predefined number of sequential failures (currently 5), it will temporarily time-out the ILS and deny all further authentication requests to that ILS (i.e., the breaker is "tripped"). When the time-out expires (currently 2 minutes), it will allow 1 request to try, and if that request is successful, the circuit breaker will reset itself. If the 1 request fails, it will continue for another time-out period. And so on, until the circuit breaker can finally reset itself after a successful request.

Although the circuit breaker helps keep bursts of communication errors under control, the primary objective of the feature is to protect our downstream clients. It is designed to control traffic to a struggling ILS server so that it has time to recover before we send further traffic.

Additionally, the Patron Authentication API has been updated to properly report this activity back to our API clients. The response to a request to the Patron Authentication API when it encounters a "tripped" circuit breaker will be an HTTP 504 with the following message in the body of the JSON: “Too many errors from the underlying ILS, throttling communication to it temporarily. Please try again shortly.”

Thanks for your patience as we diagnosed this issue and identified the best possible solution for it.

-The OverDrive API Team