This is a postmortem on this specific issue: Intermittent 500 Errors on Coverage Uploads
Summary
Between September 20–24, some customers experienced intermittent 500 Internal Server Error responses during coverage uploads (POST /api/v1/jobs). The issue was initially hard to diagnose because:
- Failures did not surface reliably in our error tracker (BugSnag).
- They appeared to affect only some requests, some customers.
Impact
- Some coverage uploads failed to process, causing build reporting delays or gaps.
- Frequency was low enough to appear intermittent, which delayed detection and resolution.
Timeline
- Sep 20–23: First customer reports of intermittent 500s. Initial theories involved a regression in a recent release of our coverage-reporter integration (client-side).
- Sep 23–24: Deep log analysis across ELB and application logs revealed errors concentrated on a single web server.
- Sep 24: Confirmed that server alone was responsible for thousands of SSL-related failures (
Faraday::SSLError, OpenSSL::SSL::SSLError, Seahorse::Client::NetworkingError). Other servers were clean.
- Sep 24: Mitigation: that server was destroyed. Errors ceased immediately.
Root Cause
In terms of possible cause, we believe this additional Web server was provisioned during autoscaling with a different Ubuntu version than the rest of the fleet. This seemed to result in a broken or outdated CA certificate store, causing outbound SSL connections (GitHub, Travis, etc.) to fail intermittently—but then bubble up to a 500 error for the original request (POST to /api.v1/jobs).
Resolution
- Problematic server removed from service.
- Future mitigation: verify baseline OS/version and CA store when adding new servers, especially via automation.
- Next step: document the correct procedure to disable a single server in Cloud66 load balancers, instead of outright destroying, so we can retain the server for forensic investigations.
Lessons Learned
- Errors can hide if they don’t surface in the bug tracker. Direct log analysis is essential.
- Even one misconfigured server can cause significant customer impact.
- Consistency in OS/base image, and CA, is critical.