Incident Postmortem: Database Partitioning Bottleneck & Job Backlog
Summary
A database partitioning limitation caused a severe backlog of background jobs, leading to degraded build processing times from Monday, February 10, to Friday, February 14, 2025. The backlog resulted from excessive autovacuum contention on a high-growth table, which ultimately led to cascading failures in job processing, database performance, and monitoring visibility.
Root Cause
Several tables in our production database grow at an accelerated rate. While we employ a partitioning strategy to prevent them from becoming unwieldy, our time-based approach failed to transition a critical table before it reached an unmanageable size.
During investigation, we found:
- The table was in a perpetual state of autovacuum due to an excessive number of dead tuples.
- The high volume of tuples prevented autovacuum from progressing beyond the scanning phase, causing tuple locks.
- These locks delayed regular transactions, leading to transaction backups that worsened over time.
- By late Tuesday, February 11, the backlog had reached a breaking point, causing tens—eventually hundreds—of thousands of jobs to accumulate.
Impact
Background Job Delays
- A significant job queue buildup occurred between February 10 and February 11.
- Clearing the backlog took an additional two days (February 11–13).
- Failed jobs in long-tail retries prolonged the impact for another 24–36 hours.
Monitoring Gaps & Alert Failures
- Average job duration alerts triggered only after the queue size became a critical issue.
- As server load increased, monitoring metrics stopped logging, preventing alerts that could have provided earlier intervention signals.
Database & Infrastructure Overload
- Scaling up resources to clear the backlog introduced additional database contention due to high transaction volumes, exacerbating delays.
- The increased database load led to degraded server performance, disconnecting our orchestration layer and APM monitoring.
- This created a self-reinforcing failure loop that required continuous manual intervention from February 12 to February 13.
Resolution
- We transitioned the affected table, significantly relieving the bottleneck.
- We scaled up resources to process the backlog, though this required careful throttling to avoid further database contention.
- By Thursday, February 13, we placed the site into maintenance mode for 30 minutes to reduce load—but ultimately needed nearly two hours to restore stability.
- To prevent immediate re-saturation, we deferred processing some older jobs to lower-traffic periods.
- By Thursday evening, build times stabilized as overall traffic declined.
- By Friday morning, February 14, all remaining queued jobs had processed without further intervention.
Next Steps
Finalizing Database Transitions
- To fully resolve performance degradation, we transitioned two additional tables closely related to the affected table.
- This was completed during a maintenance window on Saturday, February 15 (8 PM – 11:59 PM PST).
Long-Term Database Optimizations
- We will perform VACUUM FULL on legacy tables to remove ~36B dead tuples and optimize disk layout.
- Further maintenance windows will be scheduled on late-night weekends.
Partitioning Strategy Enhancements
- We are evaluating size-based partitioning or a refined time-based strategy with shorter intervals to prevent similar issues.
Improved Monitoring & Alerting
- We will introduce earlier warning thresholds to detect job queue buildup before it becomes critical.
- We will enhance database contention monitoring to catch autovacuum failures and lock contention earlier.
Conclusion
Even after 12+ years in production, incidents like this remind us of the importance of continually evolving our data management and monitoring practices. As Coveralls scales, we are committed to refining our approach to proactively address infrastructure challenges before they affect users.
We sincerely apologize to all users affected by this incident. If you need assistance with historical builds or workflow adjustments, or if you'd like to share feedback, please contact us at support@coveralls.io. Your input will help us shape future improvements.