Coveralls is in Read Only mode while we work on updating the system. Sorry for the inconvenience.

VIGILANCE! Check this page any time you notice a problem with coveralls

Service disruption for some users

Incident Report for Coveralls

Postmortem

Incident Postmortem: Database Partitioning Bottleneck & Job Backlog

Summary

A database partitioning limitation caused a severe backlog of background jobs, leading to degraded build processing times from Monday, February 10, to Friday, February 14, 2025. The backlog resulted from excessive autovacuum contention on a high-growth table, which ultimately led to cascading failures in job processing, database performance, and monitoring visibility.

Root Cause

Several tables in our production database grow at an accelerated rate. While we employ a partitioning strategy to prevent them from becoming unwieldy, our time-based approach failed to transition a critical table before it reached an unmanageable size.

During investigation, we found:

The table was in a perpetual state of autovacuum due to an excessive number of dead tuples.
The high volume of tuples prevented autovacuum from progressing beyond the scanning phase, causing tuple locks.
These locks delayed regular transactions, leading to transaction backups that worsened over time.
By late Tuesday, February 11, the backlog had reached a breaking point, causing tens—eventually hundreds—of thousands of jobs to accumulate.

Impact

Background Job Delays
1. A significant job queue buildup occurred between February 10 and February 11.
2. Clearing the backlog took an additional two days (February 11–13).
3. Failed jobs in long-tail retries prolonged the impact for another 24–36 hours.
Monitoring Gaps & Alert Failures
1. Average job duration alerts triggered only after the queue size became a critical issue.
2. As server load increased, monitoring metrics stopped logging, preventing alerts that could have provided earlier intervention signals.
Database & Infrastructure Overload
1. Scaling up resources to clear the backlog introduced additional database contention due to high transaction volumes, exacerbating delays.
2. The increased database load led to degraded server performance, disconnecting our orchestration layer and APM monitoring.
3. This created a self-reinforcing failure loop that required continuous manual intervention from February 12 to February 13.

Resolution

We transitioned the affected table, significantly relieving the bottleneck.
We scaled up resources to process the backlog, though this required careful throttling to avoid further database contention.
By Thursday, February 13, we placed the site into maintenance mode for 30 minutes to reduce load—but ultimately needed nearly two hours to restore stability.
To prevent immediate re-saturation, we deferred processing some older jobs to lower-traffic periods.
By Thursday evening, build times stabilized as overall traffic declined.
By Friday morning, February 14, all remaining queued jobs had processed without further intervention.

Next Steps

Finalizing Database Transitions
1. To fully resolve performance degradation, we transitioned two additional tables closely related to the affected table.
2. This was completed during a maintenance window on Saturday, February 15 (8 PM – 11:59 PM PST).
Long-Term Database Optimizations
1. We will perform VACUUM FULL on legacy tables to remove ~36B dead tuples and optimize disk layout.
2. Further maintenance windows will be scheduled on late-night weekends.
Partitioning Strategy Enhancements
1. We are evaluating size-based partitioning or a refined time-based strategy with shorter intervals to prevent similar issues.
Improved Monitoring & Alerting
1. We will introduce earlier warning thresholds to detect job queue buildup before it becomes critical.
2. We will enhance database contention monitoring to catch autovacuum failures and lock contention earlier.

Conclusion

Even after 12+ years in production, incidents like this remind us of the importance of continually evolving our data management and monitoring practices. As Coveralls scales, we are committed to refining our approach to proactively address infrastructure challenges before they affect users.

We sincerely apologize to all users affected by this incident. If you need assistance with historical builds or workflow adjustments, or if you'd like to share feedback, please contact us at support@coveralls.io. Your input will help us shape future improvements.

Posted Feb 17, 2025 - 09:25 PST

Resolved

We are closing this issue but will continue to monitor as we clear the remaining queues of background jobs from yesterday. If you believe any of your recent builds are still affected (incomplete), or if you are having any issues uploading coverage reports, please reach out to us at support@coveralls.io.

Posted Feb 13, 2025 - 12:41 PST

Update

We are out of maintenance mode and monitoring live transactions.

Posted Feb 13, 2025 - 09:40 PST

Update

Our ETA for reverting maintenance mode is within the next 30 minutes.

Posted Feb 13, 2025 - 09:15 PST

Update

Use "fail on error" to keep Coveralls 4xx from failing your CI builds / holding up your PRs:

While our API is in maintenance mode, new coverage report uploads (POSTs to /api/v1/jobs) will fail with a 405 or other 4xx error.

To keep this from breaking your CI builds and holding up your PRs, allow coveralls steps to "fail on error."

If you are using one of our Official Integrations, add:

- `fail-on-error: false` if using Coveralls GitHub Action
- `fail_on_error: false` if using Coveralls Orb for CircleCI
- `--no-fail` flag if using Coveralls Coverage Reporter directly

Documentation:
- Official Integrations: https://docs.coveralls.io/integrations#official-integrations
- Coveralls GitHub Action: https://github.com/marketplace/actions/coveralls-github-action
- Coveralls Orb for CircleCI: https://circleci.com/developer/orbs/orb/coveralls/coveralls
- Coveralls Coverage Reporter: https://github.com/coverallsapp/coverage-reporter

Reach out to support@coveralls.io if you need help.

Posted Feb 13, 2025 - 05:33 PST

Update

We are in maintenance mode as we perform some database tasks to improve our performance in clearing background jobs still stuck-in-queue. ETA: 2 hrs. But we may need to update this as we monitor progress.

Posted Feb 13, 2025 - 05:00 PST

Update

We are not clearing background jobs fast enough to recover by morning US PST, so we will be putting the site into read-only mode for about 2 hrs from (4:30a-6:30a US PST) in order to perform some database operations.

Posted Feb 13, 2025 - 04:16 PST

Update

We are continuing to monitor for any further issues.

Posted Feb 12, 2025 - 21:33 PST

Update

We have resolved the partial outages and are monitoring.

Posted Feb 12, 2025 - 14:25 PST

Update

We are still experiencing partial outages as we try to deploy across an extended fleet of servers. We are all hands and working to resolve asap.

Posted Feb 12, 2025 - 14:13 PST

Update

Partial outages. We are working to resolve asap.

Posted Feb 12, 2025 - 14:00 PST

Update

We are deploying extra servers to help clear backed up jobs. This will entail a rolling reboot, which may cause some users to lose their current connection to Coveralls.io. Your connection should be restored momentarily, so please try again in 30-sec to 1-min.

Posted Feb 12, 2025 - 13:43 PST

Update

While we have applied a fix and are monitoring for any further issues, we are clearing backlogged jobs for some accounts. If you are waiting on some recent builds to complete, please give them at least another 20 minutes to clear. If you are not seeing your builds clear after that, please reach out to us with your org/subscription and repo name(s) at support@coveralls.io.

Posted Feb 12, 2025 - 13:20 PST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 12, 2025 - 13:08 PST

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 12, 2025 - 13:07 PST

Investigating

We are currently investigating reports of service disruptions for some users, possible related to specific subscriptions or repos.

Posted Feb 12, 2025 - 12:54 PST

This incident affected: Coveralls.io Web and Coveralls.io API.