On Sunday, September 15th at 01:32 UTC, orchestrator and other component jobs started failing in the EU region. In the following hours, our worker servers weren't able to handle the workload, and the job backlog started to increase. We manually resolved the incident, and the platform was in full operation with a clean backlog at 08:26 UTC.
One of the MySQL instances was automatically restarted and patched on September 15th at 01:32 UTC.
The instance is required for the lock mechanism for job processing, and it also stores information about queues for the worker servers. The 2-minute downtime of the database instance caused a failure of the jobs that were running at the moment. Additionally, the running workers weren't able to fetch the information about the queues, and some of them gave up restarts and stopped. With only half of the processing capacity left, the workload could not be processed.
Once we discovered the incident, we replaced all our worker servers and added more capacity to clean up the backlog faster.
What are we doing about this?
We have implemented notifications about upcoming instance patches and are going to perform updates during scheduled and announced maintenance windows.
We are also working on a completely new job processing and scheduling mechanism that will prevent similar issues from occurring down the road. We sincerely apologize for the inconvenience caused.