Date of Incident: April 6–7, 2025
Published: April 11, 2025
This postmortem outlines a recent incident that affected performance and stability for some of our customers. In the spirit of transparency and continuous improvement, we’re sharing what happened, what we learned, and how we’re moving forward.
While our team has deep experience managing production infrastructure, we aren’t full-time PostgreSQL specialists. We’re full-stack developers — and this incident pushed us deeper into the internals of our database than we’d ever had to go before. It revealed blind spots, forced urgent decisions, and ultimately gave us a deeper understanding of what it will take to make Coveralls more resilient going forward.
One of the nice things about developing an application for developers is that when we share our experience with technical challenges and lessons learned, we know many of our customers will relate.
Between Sunday, April 6 (PDT), and early morning Monday, April 7 (PDT), Coveralls experienced performance degradation and elevated error rates affecting coverage reporting and related functionality across the platform. This culminated in a database overload and service outage from approximately 10:45 PM to 1:15 AM PDT.
Roughly six weeks before this incident, we began seeing signs of database degradation — slower queries, rising table bloat, and an alarming acceleration in overall database size. Upon investigation, we realized that routine PostgreSQL maintenance (VACUUM, ANALYZE, autovacuum) — previously effective with our parameter tuning — was no longer working. This marked the beginning of a “runaway cycle,” where dead tuples piled up faster than the system could clear them, causing performance to degrade further, which in turn made optimizations even slower.
In retrospect, we now understand this shift coincided with our database exceeding what we now recognize to be its practical operational ceiling — a threshold around 50–60% of our 65TB AWS RDS instance size. While well below the official storage limit, this was the point at which PostgreSQL’s internal optimization routines began falling behind our write-heavy workload. Once we crossed it, most of the practices we had relied on for years to manage data growth and performance simply stopped working.
In response, we began scheduling emergency optimization efforts across successive weekend maintenance windows. These included offline cleanup routines targeting bloated tables and long-running transactions, along with backups and preparatory work to reorganize our largest table partitions. The goal was to reduce database size while preserving historical data — and to free up space for optimizations that could no longer run reliably at our current scale. Unfortunately, one after another, those operations failed to complete in the allotted time — leaving us with fewer and fewer options as our database continued creeping toward the 65TB physical limit, and with it, the risk of out-of-disk errors and potential service failure.
The direct cause of the outage on April 6–7 was our decision to let three long-running VACUUM FULL
operations continue on legacy partitioned tables after the end of our scheduled maintenance window. We had successfully reclaimed ~20TB of disk space earlier that day and expected the remaining operations — anticipated to reclaim an additional ~20TB — to finish within 1–2 hours based on size comparisons and previous completion times.
Believing the risk was minimal — given the tables contained historical build data no longer accessed by active jobs or user queries — we reopened the application to production traffic. However, what we failed to anticipate was that:
This led to total connection saturation, application failures, and a full outage until the vacuum operations were forcibly terminated.
To reclaim space and restore performance, we began by decommissioning a set of legacy partitions containing historical build data no longer accessed by active workflows — a necessary step to reclaim ~20TB of space and allow critical optimization routines to proceed. We then launched nine parallel VACUUM FULL
operations on bloat-heavy legacy tables. As the maintenance window ended, three of those operations were still running — expected to finish soon.
Rather than risk losing the disk space gains by canceling mid-operation, we opted to let them finish. We added app-level protections to exclude the old data from queries and ran QA tests to ensure performance remained unaffected. Everything appeared stable: queries ran normally, monitoring showed no signs of blockage, and one engineer stayed online for two extra hours to confirm system health before calling it a night.
At 10:45 PM, the site began returning 500 errors. Alerts indicated multiple servers were down, and the database was rejecting new connections with “Out of connections: Connections reserved for superuser.” RDS showed over 1,400 active connections — far beyond any level we’d previously seen in production. Most were tied up in blocked transactions: system-level activity and background reconciliation jobs that became stuck waiting on table locks held by the ongoing VACUUM FULL
operations.
We:
VACUUM FULL
tasksThis cleared the backlog and allowed the system to recover. The site was brought back online at 1:15 AM PDT and resumed normal operation.
All remained well through 8:00 AM Monday PDT — the beginning of our regular Monday morning traffic spike. The system stayed stable and error-free under close watch. While a few queries ran slower than usual, autovacuum had resumed, and performance steadily improved over the following days as vacuum thresholds were tuned.
VACUUM FULL
must never run in production again: It will only ever be used in offline maintenance contexts, and only when it can complete fully before reopening the application to production traffic (We feel PostgreSQL DBAs nodding slowly with eyes closed at this one)Short-Term
VACUUM FULL
operationsautovacuum
lag, blocked queries, and idle-in-transaction statesLong-Term
This incident was a turning point. It taught us that the scale we’re operating at now requires a different mindset — one grounded in constraints we hadn’t fully appreciated before.
When faced with a problem of scale and an urgent need to transition our infrastructure, we thought we had a clear destination in mind and rushed to get there — only to find the tracks couldn’t support the weight we were carrying. Now we understand this is a longer journey than we expected. We may not see the final station yet, but we know how to move forward: one stop at a time — making sure each station is stable, sustainable, and ready to serve our customers well before moving to the next.
As mentioned above, no one on staff here is a PostgreSQL expert — and this incident forced us to learn more, and sooner, than we ever expected about running PostgreSQL at scale.
We’d like to express our thanks to the many contributors and educators who share their hard-won PostgreSQL knowledge. Without their guidance, we couldn’t have learned what we now know—or recognize how much more there is to learn.
Chelsea Dole, of Citadel (formerly of Brex), for:
Peter Geoghegan, PostgreSQL contributor:
Michael Christofides and Nikolay Samokhvalov, of Postgres.fm: