Postmortem:
We want to share a postmortem on this incident since it took us an unusually long time to identify its root cause and resolve it, and since it affected an unusually large number of users throughout its course.
Summary:
The cause of this incident was a failure to allocate sufficient resources to, or put in place sufficient monitoring of, an existing background job queue after assigning a new background job to it. To avoid incidents of this type in the future, we have implemented a pre-deploy process for features entailing new background jobs, which is something we’ve done less and less frequently over the past number of years as our codebase and infrastructure have matured.
Cause of incident:
- We deployed an optimization earlier in the week last week (Mon, Apr 1) meant to address Gateway Timeout errors experienced by a small number of customers with massively parallel builds (builds with hundreds of parallel jobs).
- As part of this optimization, we moved a common process, “Job creation,” to a new background job and, in a mindset of "this is an experiment, let's see how it goes," chose a readily available (ie. traffic-free) queue (our default queue), released it to production, and watched it for a day and a half with good results. The change resolved the issue we aimed to fix, and all looked good from the standpoint of error tracking and performance.
- Unfortunately, while we considered traffic in our selection of a queue during initial implementation, we did not consider the need to create a permanent, dedicated queue for the new background job (which also represented a new class of background job), nor did we, after seeing good performance on Mon-Tue, evaluate the need to change any configuration details for our default queue, which turned out to be not only insufficiently resourced, but also insufficiently monitored.
- As a result, later in the week when we entered our busiest period (Wed-Thu), the new queue backed up. But we didn't know it because we didn't have visibility, and, since the nature of the new background job (Job creation) was such that it preceded a full series of subsequent jobs, it began acting as a gateway mechanism, artificially limiting traffic to downstream queues, which were being monitored, where, of course, everything looked hunky-dory across all of those metrics.
- By the time we realized what was going on, we had 35K jobs stuck in the newly utilized queue.
- At that point, the issue was easy to fix---first, by scaling up, and then by allocating proper resources to the new queue going forward---but for most of the day we did not understand what was going on so it caused problems for those hours and, as backed-up jobs accrued, affected a growing number of users as time ticked by.
Actions taken to avoid future incidents of this type:
Hindsight being 20/20, we clearly could have avoided this incident with a little more process around deploys of certain types of features---in particular, features entailing the creation of new background jobs (something we had not done in any significant way for over a year prior).
As avoidable as the initial misstep here was, its impact was great in the way it led us to miss the true underlying issue for most of an 18 hour period---which is just not acceptable in a production environment.
In response to this incident, we have added the following new step to our deployment process:
- Prior to deployment, if changes entail the creation of any new background jobs, or modification of any existing background jobs, we must evaluate the need to update our Sidekiq configuration, including the creation of any new workers or worker groups.
We’ve been operating Coveralls.io for over 13 years now, but we are, of course, far from perfect in doing so, and, clearly, we still make mistakes. While mistakes are probably unavoidable, our main goal in addressing them is to try not to make the same mistake twice. This was a new one for us (or at least new in recent years for our current team), and it has caused us to shore up our SOPs around deploys in a way that should reduce this type of incident in the future.