Python Workspaces Fail to Run in North Europe stack

Since 2021-03-25 12:30 UTC we are experiencing failures during starting of Python and Python MLflow workspaces in North Europe stack. We are working on resolving the situation and keep you posted.

UPDATE 13:30 UTC: We identified the source of the problem as the lack of hardware resources in the North Europe Azure datacenter. We are working on a switch to another type of computing instances that would not be affected by this shortage.

UPDATE 13:55 UTC: We switched the infrastructure to another type of computing instances and the workspaces are running again. We are still monitoring the situation.

Jobs performance degradation in EU and US stack

We are investigating slower jobs performance since Sunday night (2021-03-21), in EU AWS and US AWS stacks. We're seeing the issue only for particular projects and limited number of jobs. Most of the jobs are not affected. 

We are still investigating the causes. If you are experiencing a substantial slowdown of job processing in your project please contact us via the support in the project.

UPDATE 20:28 UTC: We have restarted all job worker nodes in US AWS and EU AWS stacks and monitoring the results. This should have no impact on currently running jobs.

UPDATE March 23, 2021, 10:00 UTC: After the restart all systems are operating normally, we don't see any unusual delays in running jobs. We continue to monitor the situation but unless the situation escalates we consider this incident resolved.

Authentication errors across multiple Microsoft services

March 15, 2021, 20:16 UTC We're seeing an authentication errors of multiple Microsoft or Azure related services. This affects all stacks and all regions. In AWS regions only some component jobs are affected (mainly PowerBI writer, OneDrive Excel Sheets writer, OneDrive Excel Sheets extractor). in Azure regions more services may be affected.

You may see application errors or increased job running times.

See https://status.azure.com/en-us/status for more details.

Next update in 60 minutes.

UPDATE March 15, 2021, 21:44 UTC - Engineers at Azure are currently rolling out mitigation worldwide. Full mitigation expected within 60 minutes.

UPDATE March 16, 2021, 0:16 UTC - Engineers at Azure have rolled out a fix to all affected regions for Azure AD. Internal telemetry and customer reports suggests that the error rate for dependent services is rapidly decreasing. Microsoft services are in the process of recovery. Some services may have varying times of recovery following the underlying fix. The next update will be provided at 8:00 UTC.

UPDATE March 16, 2021, 7:52 UTC - The issue related to Azure Active Directory has been resolved. Azure reports problems with authentication to the Storage Service now, but we don't see any impact on Keboola Connection. The next update will be provided at 12:00 UTC.

UPDATE March 16, 2021, 11:23 UTC - Issues on the Azure services are solved. The incident is now resolved.

Orchestration Notifications incident

2021-01-28 at 17:03 UTC We're investigating an issue with Orchestration notifications in EU and US AWS stacks. You may not be receiving notifications for failed or long running orchestrations.

2021-01-28 at 17:39 UTC Notifications are being sent again.

2021-01-28 at 18:30 UTC The issue is resolved. There were a total of 11 affected orchestrations. We sent tickets to all affected projects (unless the given orchestration was failing regularly).


Introducing Keboola Community!

We are proud to announce that we have just launched Keboola Community!

Keboola Community is the place where you can find all feature announcements, share advice, submit feedback, and contribute to discussions about Keboola. So from now on, you can find all news about Keboola on the community.keboola.com website.

This page (status.keboola.com) will still be providing updates about the performance status and ongoing incidents of the Keboola Connection platform.

We are looking forward to seeing you on both websites!

Post-mortem: Failing transformation in EU region

This is a post-mortem of the Failing transformation in EU region incident. 

We received a root-cause analysis from Snowflake and learned that the problem was caused by a DNS resolution issue. The nodes could not reach the NTP service, causing some clock skew and failed jobs for several Keboola customers. Snowflake replaced the nodes that were affected and resolved the issue. 

We learned that even the jobs that did not fail might have produced incorrect results as a result of a few seconds of time skew. Server-time-related functions (current_timestamp(), current_time() and current_date()) returned values that were skewed by a few seconds from the actual time. The data would only be affected if the query modifying it used the above mentioned functions.

We're working on a report of affected queries internally. You can contact us via standard support channels if you think your queries are using server time in a way in which a few seconds of skew would affect the results. We'll work with you on checking the queries as soon as we have the report ready.

Investigating decreased Snowflake performance

As of January 15th we are experiencing decreased Snowflake performance in peak times resulting in slower query executions or even query timeouts.

We are actively working on this case with Snowflake support and we shall have a further update at 14:00 UTC.

We are sorry for the inconvenience.

UPDATE 19.1. 14:00 UTC - Snowflake support is continuing with a deep investigation of the problem. Next update should be 20.1. around 8:30 UTC.

UPDATE 20.1. 8:30 UTC - We have boosted the Snowflake Warehouse size to compensate for longer running queries. And we are still waiting for the update from Snowflake support.

UPDATE 20.1. 14:00 UTC - Snowflake is raising the priority of the issue internally, they've acknowledged there's something wrong with the duration of some queries. Next update will be tomorrow (21.1.) morning.

UPDATE 21.1. 13:00 UTC - Snowflake engineering has made some changes (around 4AM UTC) in our account, which they believe should help get performance back on track. We will continue monitoring the case for 24 hours, to see if the changes really helped.

UPDATE 22.1. 9.00 UTC - Both Snowflake engineering and we at Keboola can confirm that queries average times have decreased dramatically over last 24 hours. It seems like the problem has been resolved by Snowflake. We have brought our Snowflake warehouse back to its original size with boost window between 00:00 and 06:30. We will continue to monitor the case for another 24 hours