Coveralls is in Read Only mode while we work on updating the system. Sorry for the inconvenience.

VIGILANCE! Check this page any time you notice a problem with coveralls

Elevated 504 Timeout Errors

Incident Report for Coveralls

Postmortem

This is a postmortem on this specific issue: Intermittent 500 Errors on Coverage Uploads

Summary
Between September 20–24, some customers experienced intermittent 500 Internal Server Error responses during coverage uploads (POST /api/v1/jobs). The issue was initially hard to diagnose because:

  • Failures did not surface reliably in our error tracker (BugSnag).
  • They appeared to affect only some requests, some customers.

Impact

  • Some coverage uploads failed to process, causing build reporting delays or gaps.
  • Frequency was low enough to appear intermittent, which delayed detection and resolution.

Timeline

  • Sep 20–23: First customer reports of intermittent 500s. Initial theories involved a regression in a recent release of our coverage-reporter integration (client-side).
  • Sep 23–24: Deep log analysis across ELB and application logs revealed errors concentrated on a single web server.
  • Sep 24: Confirmed that server alone was responsible for thousands of SSL-related failures (Faraday::SSLError, OpenSSL::SSL::SSLError, Seahorse::Client::NetworkingError). Other servers were clean.
  • Sep 24: Mitigation: that server was destroyed. Errors ceased immediately.

Root Cause
In terms of possible cause, we believe this additional Web server was provisioned during autoscaling with a different Ubuntu version than the rest of the fleet. This seemed to result in a broken or outdated CA certificate store, causing outbound SSL connections (GitHub, Travis, etc.) to fail intermittently—but then bubble up to a 500 error for the original request (POST to /api.v1/jobs).

Resolution

  • Problematic server removed from service.
  • Future mitigation: verify baseline OS/version and CA store when adding new servers, especially via automation.
  • Next step: document the correct procedure to disable a single server in Cloud66 load balancers, instead of outright destroying, so we can retain the server for forensic investigations.

Lessons Learned

  • Errors can hide if they don’t surface in the bug tracker. Direct log analysis is essential.
  • Even one misconfigured server can cause significant customer impact.
  • Consistency in OS/base image, and CA, is critical.
Posted Sep 25, 2025 - 09:41 PDT

Resolved

500 Internal Server Errors on Uploads

The recent 500 error surfacing during some coverage uploads as:
> ⚠️ Internal server error. Please contact Coveralls team.

has been resolved. A full postmortem will be published here soon. In the meantime, you can find more detail in the main tracking issue:
https://github.com/coverallsapp/coverage-reporter/issues/180

Summary
The root cause was ultimately infrastructure-related, not a regression in recent coverage-reporter releases. The previous workaround of pinning your coverage-reporter version is therefore not required.

We have decided to close this incident, which we intentionally kept open for over a week to track a series of 504 and 5xx issues with overlapping root causes. In hindsight, the broadened scope made updates less clear than we'd hoped. With today’s resolution and the mitigations applied throughout the week, the occurrence of 504 errors during uploads (POSTs) has been significantly reduced. Going forward, any new 504 errors should be considered unexpected, isolated events.

At the same time, we continue work on several instances of intermittent GET-related 504 errors affecting:

- Source File pages
- Repo pages
- Add Repos pages

Progress on those issues will be reported separately here:
https://github.com/lemurheavy/coveralls-public/issues/1757
Posted Sep 24, 2025 - 17:53 PDT

Update

Fix for unrelated 500 errors:

If you receive a `500` error with this error message format:
> ⚠️ Internal server error. Please contact Coveralls team.

Please know it is unrelated to the `504` errors being monitored in this open incident.

Those, intermittent `500` errors are caused by a regression in one of the latest coverage-reporter releases: `v0.6.16` or `v0.6.17`.

Workaround:
Pin your coverage-reporter-version to `v0.6.15` in your integration config.

For thorough instructions, see this public issue:
https://github.com/coverallsapp/coverage-reporter/issues/180

We’re investigating the root cause and will post updates once a fix is released.
Posted Sep 23, 2025 - 12:14 PDT

Update

Mitigated – Monitoring

All systems operational.

Recent mitigations, including fleet expansion and autoscaling, have reduced 504 timeout reports significantly. The remaining reports are infrequent and occur mostly during overnight and weekend hours (PDT).

We are continuing to monitor closely and are working on a multi-part solution to eliminate all known causes. Until then, we are keeping this incident open in Monitoring. We will close it once 504 errors have returned to being unexpected, isolated events.
Posted Sep 22, 2025 - 09:43 PDT

Update

Mitigation in place.

All systems operational.

This morning we deployed additional capacity and autoscaling measures to reduce 504 errors on coverage report uploads:

- Doubled our web server fleet (on top of the prior doubling when this issue began).
- Enabled autoscaling at the web layer, allowing the fleet to double again automatically when NGINX response times exceed thresholds.

The underlying trigger remains rare surges of upload requests from outlier repositories (750–1250 uploads per build). While we have paused processing for these repos, our HTTP servers must still handle the incoming requests until they stop.

Timezone coverage:
As a small team based in Los Angeles (PDT), our ability to respond in real time is most limited overnight (10p–6a PDT). Unfortunately, the primary outlier repos are in APAC, making this the window of highest risk. With these changes, we hope to reduce the occurrence of upload 504s during this window.

We will monitor results closely and continue tuning autoscaling thresholds. Please let us know if you continue to see 504 errors on uploads.
Posted Sep 15, 2025 - 11:57 PDT

Update

All systems operational.

Earlier today (6:45–7:45 AM PDT), we received elevated reports of 504 timeout errors. We have not been able to reproduce the issue since, but if you are still experiencing errors, please contact us at support@coveralls.io.

The affected areas may include:

- Coverage Report Uploads (/api/v1/jobs)
- Add Repos Page
- Repo Page
- Source File Page

Fixes for the Add Repos, Repo, and Source File pages are scheduled to be deployed by end of day (PDT).
Posted Sep 12, 2025 - 08:44 PDT

Update

All systems operational.

We have released one of two parts of a near-term solution into production resolving a minority subset of 504 errors. We are still working on releasing part two into production.

Subscribe for updates at this status page, or follow this public tracking issue for updates:
https://github.com/lemurheavy/coveralls-public/issues/1757
Posted Sep 09, 2025 - 08:24 PDT

Update

All systems operational.

Continuing to keep this open until we have released our short-term fix into production.

Subscribe for updates at this status page, or follow this public tracking issue for updates:
https://github.com/lemurheavy/coveralls-public/issues/1757
Posted Sep 08, 2025 - 11:57 PDT

Update

We are still working on a near-term fix. We will post here, and here when complete:
https://github.com/lemurheavy/coveralls-public/issues/1757
Posted Sep 05, 2025 - 09:05 PDT

Monitoring

We’re currently seeing elevated reports of 504 Timeout errors affecting some customers on a subset of Coveralls pages, including:

- Source File pages
- Repo pages
- Add Repos pages

All systems and pages are generally operational; a subset of customers are experiencing these errors, sometimes intermittently.

There is a public tracking issue for the Source File timeout errors here:
https://github.com/lemurheavy/coveralls-public/issues/1757

Fix in progress:
We’re implementing a short-term fix over the next 24–48 hours, which should eliminate the timeouts.

A longer-term fix is also planned, but will roll out over several weeks, but early phases of that implementation should also reduce the request times that were originally triggering the 504 timeouts.

What you can do:
If you're currently affected, we recommend following updates here, and subscribing to the public issue: https://github.com/lemurheavy/coveralls-public/issues/1757

If your issue pattern differs from above, or you suspect a different root cause, reach out to support@coveralls.io, and we'll verify for you.
Posted Sep 03, 2025 - 11:00 PDT
This incident affected: Coveralls.io Web and Coveralls.io API.