The root cause of the the backed up Web and API requests was slow reads against the source_files table in our database (our largest table), themselves caused by a long running database maintenance task.
While that task (“repacking” the source_files table) had been planned for, and started, over the previous weekend, it unexpectedly proceeded well into the week. After seeing normal site behavior Mon and Tue, we decided to let the procedure continue because of its importance to general database performance, but we believe that when we hit our weekly usage peak (Wed-Thu), even as the maintenance task was nearly complete, the database was overwhelmed with read requests against the table while the maintenance activity held transaction locks against relevant rows.
In addition to the temporary action of restricting reads from our Web app, we have also curtailed all maintenance activity against the table until we can guarantee tasks will complete within normal maintenance windows (late evenings and weekends PDT).
We’ve also identified a longer term solution that involves a different approach to partitioning tables that will take 1-2 weeks and is planned for later this month.