Reason for the incident:
We failed to upgrade to a new RDS CA and update complementary SSL certs on all clients by the expiration date for the previous CA. We misunderstood the potential impact of not making this change in time and planned to make the changes as housekeeping with normal to low priority. In doing so, we failed to prioritize the ticket and make the changes necessary in time to avoid this incident.
Reason for the response time:
While trying to implement the fix, we confronted very verbose documentation that made it hard to understand how to apply the fix in our context, especially while under pressure. When we did identify the correct procedure for our context, and implemented a fix, for some reason we could not get our database clients to establish a connection in production with a freshly downloaded cert that worked in tests from local machines. In the end, we manually copied the contents of the cert into an existing file before our app recognized it. We still don’t know why, but the confusion surrounding this added at least an extra hour to our response time as we cycled through other applicable certs and recovered from failed deployments.
How to avoid the incident in the future:
We will consider all notices from infrastructure providers as requiring review by multiple stakeholders at different levels, and will apply an already established procedure for handling priority infrastructure upgrades in a timely manner, as scheduled events, with review and sign-off.