Job failures in AWS EU stack

2023-02-20 15:20 UTC - A small number of jobs on the connection.eu-central-1.keboola.com stack either ended by timeout or with a "Component terminated. Possibly due to out of memory error" error message during a recent incident between Feb 19 15:10 UTC and Feb 20 14:00 UTC due to an underlying node failure. We're actively investigating the cause and taking measures to prevent this from happening again. 

2023-02-20 15:56 UTC - The incident has been resolved, with the last occurrence of the error happening at Feb 20 14:35 UTC. We are continuing to monitor the situation closely to prevent any reoccurrence. 

Failing jobs on all stacks

2023-02-10 09:20 UTC - We are currently investigating the problem of failing jobs on all stacks that occurred on 2023-02-09 08:48 UTC. The error is manifested by the error message "K8S request has failed: events is forbidden: User "system:serviceaccount:job-queue-jobs:daemon-service-account" cannot list resource "events" in API group "" in the namespace "job-queue-jobs"".

UPDATE 09:41 UTC: We have identified the problem and rolled back previous version of our service. All services are now operating normally.

UPDATE 10:35 UTC After a deeper research we found that this problem affected only a small fraction of the jobs.

We're sorry for this inconvenience. 

Storage jobs restarts

2023-02-09 10:07 - We are currently investigating storage job restarts that occurred on 2023-02-09 07:35 UTC and 2023-02-07 08:04 UTC. These restarts have caused longer job run times or errors such as "table already exists" during transformation executions. We will provide another update when new information is available.

2023-02-09 10:57 - We have identified the root cause. We will deploy a fix within two hours, which might cause another occurrence of these restarts for some jobs.

2023-02-09 13:53 - We have deployed a fix at 13:20 UTC which caused the last occurrences of restarts. The issue is now resolved and you should not experience any more job restarts.

Templates & Keboola CLI errors

10:50 UTC Due to recent changes in Storage API, the Templates API and Keboola CLI are returning errors in multiple situations since approximately 9:00 UTC. As a result, you might see unexpected errors when working with the Keboola CLI or when trying to apply templates. We're working on the fix, which is expected to be released today ETA 15:00 UTC.

13:05 UTC Issue on Storage API was fixed. All services are now operating normally. We apologize for any inconvenience this may have caused.

Service disruption in Azure and Snowflake (in Azure regions)

Azure and Snowflake in Azure regions are reporting general service disruptions. We are closely monitoring the situation and, so far, we have observed only a few symptoms of the issues and the platform operations have not been impacted. Please refer to the status updates of the affected services for more information.

We're sorry for this inconvenience. 

Delayed jobs on Azure North Europe stack

2023-01-23 21:50 UTC We're investigating increased job wait times in Azure North Europe stack AWS US stack (connection.north-europe.azure.keboola.com) . Next update in 15 minutes or when new information is available. 

2023-01-23 22:10 UTC The root cause was fixed and all operations are back to normal.

Increased job wait times in AWS US and EU stack

We're investigating increased job wait times in AWS US stack (connection.keboola.com) and AWS EU stack (connection.eu-central-1.keboola.com) . Next update in 15 minutes or when new information is available. 


UPDATE 12:55 UTC: We have identified the problem and rolled back previous version of our service.

UPDATE 13:05 UTC: All services are now operating normally.

Increased error rate in AWS US stack

We're investigating increased error rate in AWS US stack (connection.keboola.com). Next update in 15 minutes or when new information is available. 

UPDATE 04:40 UTC: We have identified and replaced a number of corrupted nodes with healthy ones, and operations are now back to normal. We apologize for the inconvenience caused.

UPDATE 05:40 UTC: This issue appears to be ongoing, and a new symptom has been identified: jobs are taking longer to start than usual, or are getting stuck in a waiting state. The next update will be in 30 minutes.

UPDATE 06:25 UTC: We're still investigating the issue. Next update in 30 minutes.

UPDATE 07:15 UTC: We have found the root cause and we're fixing it. 

UPDATE 08:44 UTC: The root cause was fixed and all operations are back to normal.