Coveralls is in Read Only mode while we work on updating the system. Sorry for the inconvenience.

VIGILANCE! Check this page any time you notice a problem with coveralls

Partial Outage - Database Connection Overload

Incident Report for Coveralls

Postmortem

Coveralls Incident Postmortem

Date of Incident: April 6–7, 2025
Published: April 11, 2025

Introduction

This postmortem outlines a recent incident that affected performance and stability for some of our customers. In the spirit of transparency and continuous improvement, we’re sharing what happened, what we learned, and how we’re moving forward.

While our team has deep experience managing production infrastructure, we aren’t full-time PostgreSQL specialists. We’re full-stack developers — and this incident pushed us deeper into the internals of our database than we’d ever had to go before. It revealed blind spots, forced urgent decisions, and ultimately gave us a deeper understanding of what it will take to make Coveralls more resilient going forward.

One of the nice things about developing an application for developers is that when we share our experience with technical challenges and lessons learned, we know many of our customers will relate.

Incident Summary

Between Sunday, April 6 (PDT), and early morning Monday, April 7 (PDT), Coveralls experienced performance degradation and elevated error rates affecting coverage reporting and related functionality across the platform. This culminated in a database overload and service outage from approximately 10:45 PM to 1:15 AM PDT.

Impact

  • Duration: Approximately 8 hours of compounding service degradation, culminating in a full outage from ~10:45 PM Sun to 1:15 AM Mon PDT
  • Affected Systems: Coverage uploads, report generation, dashboard access, notifications delivery
  • Customer Impact: Delays in CI workflows, missing or partial coverage reports, stalled PR checks

Root Cause: Systemic Degradation

Roughly six weeks before this incident, we began seeing signs of database degradation — slower queries, rising table bloat, and an alarming acceleration in overall database size. Upon investigation, we realized that routine PostgreSQL maintenance (VACUUM, ANALYZE, autovacuum) — previously effective with our parameter tuning — was no longer working. This marked the beginning of a “runaway cycle,” where dead tuples piled up faster than the system could clear them, causing performance to degrade further, which in turn made optimizations even slower.

In retrospect, we now understand this shift coincided with our database exceeding what we now recognize to be its practical operational ceiling — a threshold around 50–60% of our 65TB AWS RDS instance size. While well below the official storage limit, this was the point at which PostgreSQL’s internal optimization routines began falling behind our write-heavy workload. Once we crossed it, most of the practices we had relied on for years to manage data growth and performance simply stopped working.

In response, we began scheduling emergency optimization efforts across successive weekend maintenance windows. These included offline cleanup routines targeting bloated tables and long-running transactions, along with backups and preparatory work to reorganize our largest table partitions. The goal was to reduce database size while preserving historical data — and to free up space for optimizations that could no longer run reliably at our current scale. Unfortunately, one after another, those operations failed to complete in the allotted time — leaving us with fewer and fewer options as our database continued creeping toward the 65TB physical limit, and with it, the risk of out-of-disk errors and potential service failure.

Root Cause: Immediate Incident Trigger

The direct cause of the outage on April 6–7 was our decision to let three long-running VACUUM FULL operations continue on legacy partitioned tables after the end of our scheduled maintenance window. We had successfully reclaimed ~20TB of disk space earlier that day and expected the remaining operations — anticipated to reclaim an additional ~20TB — to finish within 1–2 hours based on size comparisons and previous completion times.

Believing the risk was minimal — given the tables contained historical build data no longer accessed by active jobs or user queries — we reopened the application to production traffic. However, what we failed to anticipate was that:

  • Background reconciliation jobs still touched those locked tables indirectly
  • PostgreSQL system processes attempted metadata access that also triggered waits
  • These queries quickly piled up behind table-level locks, consuming all available connections

This led to total connection saturation, application failures, and a full outage until the vacuum operations were forcibly terminated.

Outage Event: April 6–7 (10:45 PM – 1:15 AM PDT)

To reclaim space and restore performance, we began by decommissioning a set of legacy partitions containing historical build data no longer accessed by active workflows — a necessary step to reclaim ~20TB of space and allow critical optimization routines to proceed. We then launched nine parallel VACUUM FULL operations on bloat-heavy legacy tables. As the maintenance window ended, three of those operations were still running — expected to finish soon.

Rather than risk losing the disk space gains by canceling mid-operation, we opted to let them finish. We added app-level protections to exclude the old data from queries and ran QA tests to ensure performance remained unaffected. Everything appeared stable: queries ran normally, monitoring showed no signs of blockage, and one engineer stayed online for two extra hours to confirm system health before calling it a night.

At 10:45 PM, the site began returning 500 errors. Alerts indicated multiple servers were down, and the database was rejecting new connections with “Out of connections: Connections reserved for superuser.” RDS showed over 1,400 active connections — far beyond any level we’d previously seen in production. Most were tied up in blocked transactions: system-level activity and background reconciliation jobs that became stuck waiting on table locks held by the ongoing VACUUM FULL operations.

We:

  • Entered Maintenance Mode
  • Killed all background jobs and blocked system queries
  • Canceled the remaining VACUUM FULL tasks

This cleared the backlog and allowed the system to recover. The site was brought back online at 1:15 AM PDT and resumed normal operation.

All remained well through 8:00 AM Monday PDT — the beginning of our regular Monday morning traffic spike. The system stayed stable and error-free under close watch. While a few queries ran slower than usual, autovacuum had resumed, and performance steadily improved over the following days as vacuum thresholds were tuned.

Lessons Learned

  • Legacy tables still matter: Even unused partitions can be accessed by background queries or internal PostgreSQL processes
  • Manual testing isn’t enough: Latent issues from locked tables can take hours to appear under load
  • VACUUM FULL must never run in production again: It will only ever be used in offline maintenance contexts, and only when it can complete fully before reopening the application to production traffic (We feel PostgreSQL DBAs nodding slowly with eyes closed at this one)
  • We didn’t realize we’d crossed the rubicon: We now understand that for the way we were managing our schema, the real ceiling came far earlier than expected — around 50–60% of our RDS instance capacity. That’s where PostgreSQL’s behavior began to change, and where the practices we’d long relied on started to break down.
  • Our tables are too big: Our current manual partitioning scheme—organized by time—was not granular enough to prevent tables from growing to unmanageable sizes. Going forward, we’ll need to rethink partitioning entirely, possibly by table size, data volume, or much finer time intervals.
  • We need to tier old data: Partitioning alone isn’t enough. We need to offload older data to a different database instance or long-term storage tier — one designed for low-frequency access but with preserved integrity.
  • We can't store everything in one place forever: Coveralls has been storing every coverage report for every file in every commit for every repo it tracks for over 13 years — all in a single database instance. That approach hit its limit. That number now stands as a badge of our endurance — and a reminder of the architecture we must leave behind.

Remediation & Recovery

  • All stuck optimization jobs were terminated
  • Background workers and queues were paused and restarted safely
  • Emergency disk space was reclaimed and preserved
  • Monitoring was expanded to detect and alert on blocked queries, autovacuum status, and connection saturation

Action Items

Short-Term

  • Establish hard guardrail against closing a maintenance window with open VACUUM FULL operations
  • Restrict application access to legacy tables not actively used in production workflows
  • Implement dashboards and alerts for autovacuum lag, blocked queries, and idle-in-transaction states
  • Tighten timeout thresholds to prevent query pile-ups during unexpected contention

Long-Term

  • Complete staged migration to our new schema and infrastructure platform
  • Refactor partitioning to better manage table size and maintain performance
  • Build system-level resilience against lock pile-ups with improved observability, timeout strategy, and alerting
  • Design and implement a tiering strategy for long-term historical coverage data

Conclusion

This incident was a turning point. It taught us that the scale we’re operating at now requires a different mindset — one grounded in constraints we hadn’t fully appreciated before.

When faced with a problem of scale and an urgent need to transition our infrastructure, we thought we had a clear destination in mind and rushed to get there — only to find the tracks couldn’t support the weight we were carrying. Now we understand this is a longer journey than we expected. We may not see the final station yet, but we know how to move forward: one stop at a time — making sure each station is stable, sustainable, and ready to serve our customers well before moving to the next.

Thanks

As mentioned above, no one on staff here is a PostgreSQL expert — and this incident forced us to learn more, and sooner, than we ever expected about running PostgreSQL at scale.

We’d like to express our thanks to the many contributors and educators who share their hard-won PostgreSQL knowledge. Without their guidance, we couldn’t have learned what we now know—or recognize how much more there is to learn.

Posted Apr 11, 2025 - 15:16 PDT

Resolved

This incident has been resolved.
Posted Apr 07, 2025 - 02:14 PDT

Update

We are continuing to monitor for any further issues.
Posted Apr 07, 2025 - 02:01 PDT

Update

We are continuing to monitor for any further issues.
Posted Apr 07, 2025 - 01:52 PDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 07, 2025 - 01:39 PDT

Update

We are continuing to work on a fix for this issue.
Posted Apr 07, 2025 - 00:48 PDT

Identified

The issue has been identified and a fix is being implemented.
Posted Apr 07, 2025 - 00:05 PDT

Update

We are continuing to investigate this issue.
Posted Apr 06, 2025 - 22:37 PDT

Investigating

We are currently investigating this issue.
Posted Apr 06, 2025 - 22:37 PDT
This incident affected: Coveralls.io Web and Coveralls.io API.