2022-04-11 15:04 UTC - We are investigating transient delays in jobs processing. It manifests as a two hours gap without any activity in job events. It is happening randomly across projects and configurations, most of the occurrences are around 04:00 UTC. Only jobs running on new queue are affected. We are investigating the issue, next update in three hours or when new information will be available.
2022-04-11 16:54 UTC - We have increased minimum number of nodes which might help to avoid the issue happening again. Meanwhile we are investigating the root causes of timeouts. We are also working on decreasing timeouts from two hours to much lower value to prevent unnecessary job runtime increase in case of networking issues. Next update when new information will be available.
2022-04-14 12:54 UTC - We have reduced the timeouts from two hours to two minutes. This will prevent a job to get stuck for such a long time when a connection issue occurs. We are still investigating the root networking problem. Next update when new information is available.