Storage job failures in the AWS EU stack

We are observing an increased number of faulty storage jobs, resulting in the error message "Cannot import data from Storage API“ in connection.eu-central-1.keboola.com. The main cause has been identified and resolved, and now all systems should be running smoothly. We will continue to monitor the situation, and the next update will be provided in 30 minutes.

We apologize for any inconvenience caused.

UPDATE 7:20 UTC [resolved] All systems are functioning normally, and the incident has been resolved and closed.

New Outbound IP Addresses for Keboola Connection: Last Call

This is a reminder that the deadline to update your whitelist for the new outbound IP addresses is approaching. It is crucial to act before June 30, 2023, to avoid any disruption to your connectivity.

If you are still seeing the following alert in your projects, then you have not yet migrated to the new IP addresses:

Please note that if you have not manually updated your whitelist by the deadline, Keboola will perform the switch globally. This means that your projects will be automatically switched to the new IP addresses after June 30, 2023.

If you have not yet migrated, please follow the actions required as described in the New Outbound IP Addresses announcement.

Stuck storage jobs in Azure North Europe Stack

Today, 16th of June  since 3:03 UTC we are experiencing jobs are stuck on import and export data. It is due to a Snowflake incident in Azure west europe region https://status.snowflake.com/ where the warehouse of the Azure North Europe stack is located.

We monitor the snowflake incident and keep you updated here.

UPDATE 6:15 UTC - the snowflake incident is still ongoing, with the last update at 05:28 UTC: "We've identified an issue with a third-party service provider, and we're coordinating with the provider to develop and implement a fix to restore service. We'll provide another update within 60 minutes.". The issue is most likely due to a problem in Azure, which informed about an incident in West Europe region see https://azure.status.microsoft/en-us/status.

UPDATE 7:00 UTC  - we see progress, that is storage import/export data jobs are being processed. However the snowflke incident is still open, we continue to monitoring it.

UPDATE 8:00 UTC [resolved] - Snowflake has resolved incident stating "We've coordinated with our third-party service provider to implement the fix for this issue, and we've monitored the environment to confirm that service was restored. If you experience additional issues or have questions, please open a support case via Snowflake Community.". We don't see any more stuck jobs so we conclude it is resolved as well.

Degraded AWS US/EU Stack (connection.keboola.com,connection.eu-central-1.keboola.com)

2023-06-13 19:40 UTC Service components.keboola.com is degraded due incident in AWS US-EAST-1 Region https://health.aws.amazon.com/health/status we are monitoring situation.

2023-06-13 20:15 UTC Incident in AWS is affecting also our oauth authorization service in AWS US Stack (connection.keboola.com). All components relying on oauth authorization could be affected and may randomly fail. 

2023-06-13 20:20 UTC We are investigating slower jobs processing in AWS US Stack (connection.keboola.com)

2023-06-13 20:40 UTC Incident in AWS US-EAST-1 is causing jobs to be stuck on both AWS US (connection.keboola.com) and AWS EU (connection.eu-central-1.keboola.com) stack. This includes components jobs and services which are running jobs as part of their workflow, like creation of workspace.

2023-06-13 20:55 UTC AWS is reporting incident as resolved. All services are running normally, some jobs may still take longer to process due to large number of jobs waiting in queue. We are monitoring situation. 

Stuck jobs and failures in AWS EU stack

2023-06-06 02:08 UTC We experienced incident on connection.eu-central-1.keboola.com. Some jobs ended in error due to an underlying node failure. We're still investigating the root cause.

Update 2023-06-06 02:38 UTC The incident has been resolved. A small number of jobs on the connection.eu-central-1.keboola.com stack either ended by timeout or with a "Component terminated. Possibly due to out of memory error" error message during a recent incident. 

We are continuing to monitor the situation closely to prevent any reoccurrence. 

Platform Update: Transition to Datadog for Platform Logs Monitoring - Vendors Only

Beginning June 1st, 2023, we are transitioning our platform logs monitoring system from Papertrail to Datadog. This is a platform-level change and does not affect user experience or functionality. Regular users are not affected by this change.

For our 3rd party Keboola component vendors, this change modifies the way you receive application error notifications:

  1. Email Notifications Only: Notifications will now be sent exclusively via email. Webhook support may be considered in the future.

  2. Notification Email Address: Vendors previously notified via Papertrail or generic webhook will now receive notifications to the email address specified in their vendor profile. Vendors who were already receiving notifications via email will continue to do so at the same email address.

  3. New Sender Email Address: All notifications will come from alert@dtdg.eu.

Should our vendors have any questions or concerns regarding this change, please contact us at support@keboola.com.

Slowdown of processing of jobs on Azure North Europe stack [resolved]

Since 09:39 UTC we're seeing job starting with delays on https://connection.north-europe.azure.keboola.com/ We're investigating the situation. Next update in 30 minutes.

UPDATE 10:30 UTC we managed to find the root cause, new worker nodes have a problem authorization accessing the container registry, we are working on a fix. Next update in 30 minutes.

UPDATE 10:57UTC The problem with authorization to container registry is now solved. All systems are now operating normally.

We apologize for any inconvenience caused.

Slowdown of processing of jobs on Azure North Europe stack

Since 13:40 UTC we're seeing job starting with delays on https://connection.north-europe.azure.keboola.com/ We're investigating the situation. Next update in 30 minutes.

14:14 UTC - All systems are now operating normally.

If your project run out of credits and you have enabled automatic top-up, this would have failed between approximately 13:40 to 14:10. Restarting the job will trigger automatic top-up correctly now.

We apologize for any inconvenience caused.

Orchestrations not starting on legacy job queue

2023-04-26 11:00 UTC - We have discovered a problem with orchestrations not starting on the legacy queue. We are currently investigating possible causes.

2023-04-26 11:30 UTC - The problem was caused by a release earlier today, and as a result, no orchestrations on the legacy queue were run since 08:10 UTC. We have done rollback of the release and orchestrations should be functioning properly again as of 11:30 UTC. We apologize for any inconvenience caused.