We had a couple of servers lose comms last night for unknown reasons. Between 4pm-6pm two servers hit over 90% CPU Util until they lost comms. This means the jobs on those servers would have failed due to timeouts and been retried. This morning there were about 300 jobs still in queue, which represents a fraction of a percent of jobs processed yesterday. We think this translates to less than 1% repos affected, by which we mean repos with longer-than-average build times. If you think you were affected, let us know at firstname.lastname@example.org and we can verify.
As of 8a this morning, all components operating as normal. So no further effects are expected.
Posted Aug 12, 2022 - 11:08 PDT
This incident affected: Coveralls.io Web and Coveralls.io API.