Orchestration Notifications incident

2021-01-28 at 17:03 UTC We're investigating an issue with Orchestration notifications in EU and US AWS stacks. You may not be receiving notifications for failed or long running orchestrations.

2021-01-28 at 17:39 UTC Notifications are being sent again.

2021-01-28 at 18:30 UTC The issue is resolved. There were a total of 11 affected orchestrations. We sent tickets to all affected projects (unless the given orchestration was failing regularly).


Introducing Keboola Community!

We are proud to announce that we have just launched Keboola Community!

Keboola Community is the place where you can find all feature announcements, share advice, submit feedback, and contribute to discussions about Keboola. So from now on, you can find all news about Keboola on the community.keboola.com website.

This page (status.keboola.com) will still be providing updates about the performance status and ongoing incidents of the Keboola Connection platform.

We are looking forward to seeing you on both websites!

Post-mortem: Failing transformation in EU region

This is a post-mortem of the Failing transformation in EU region incident. 

We received a root-cause analysis from Snowflake and learned that the problem was caused by a DNS resolution issue. The nodes could not reach the NTP service, causing some clock skew and failed jobs for several Keboola customers. Snowflake replaced the nodes that were affected and resolved the issue. 

We learned that even the jobs that did not fail might have produced incorrect results as a result of a few seconds of time skew. Server-time-related functions (current_timestamp(), current_time() and current_date()) returned values that were skewed by a few seconds from the actual time. The data would only be affected if the query modifying it used the above mentioned functions.

We're working on a report of affected queries internally. You can contact us via standard support channels if you think your queries are using server time in a way in which a few seconds of skew would affect the results. We'll work with you on checking the queries as soon as we have the report ready.

Investigating decreased Snowflake performance

As of January 15th we are experiencing decreased Snowflake performance in peak times resulting in slower query executions or even query timeouts.

We are actively working on this case with Snowflake support and we shall have a further update at 14:00 UTC.

We are sorry for the inconvenience.

UPDATE 19.1. 14:00 UTC - Snowflake support is continuing with a deep investigation of the problem. Next update should be 20.1. around 8:30 UTC.

UPDATE 20.1. 8:30 UTC - We have boosted the Snowflake Warehouse size to compensate for longer running queries. And we are still waiting for the update from Snowflake support.

UPDATE 20.1. 14:00 UTC - Snowflake is raising the priority of the issue internally, they've acknowledged there's something wrong with the duration of some queries. Next update will be tomorrow (21.1.) morning.

UPDATE 21.1. 13:00 UTC - Snowflake engineering has made some changes (around 4AM UTC) in our account, which they believe should help get performance back on track. We will continue monitoring the case for 24 hours, to see if the changes really helped.

UPDATE 22.1. 9.00 UTC - Both Snowflake engineering and we at Keboola can confirm that queries average times have decreased dramatically over last 24 hours. It seems like the problem has been resolved by Snowflake. We have brought our Snowflake warehouse back to its original size with boost window between 00:00 and 06:30. We will continue to monitor the case for another 24 hours

Developer Portal Maintenance Announcement

Maintenance of Keboola Developer Portal (https://components.keboola.com) will take place on Saturday, Jan 16, 2021, at 07:00 UTC and should take less than an hour.

During the maintenance, developers won't be able to access Developer Portal frontend. Connection and components execution won't be affected.

Update 7:45 UTC: The maintenance finished successfully and Developer Portal is up and fully operational.

[Resolved] Failing transformation in EU region

Since last night (2021-11-01) we have been experiencing failing transformations with the error "Table already exists in workspace". We continue investigating and will keep you posted.

UPDATE 09:30 UTC - we have identified possible root cause and escalated the problem to Snowflake that seem to cause the errors. The problem is related to clone table input mapping, so workaround might to switch to standard copy table input mapping for the time being.

UPDATE 10:30 UTC - issues still persists, we are waiting for the findings from Snowflake.

UPDATE 11:15 UTC -  We're closely communicating with Snowflake support and we've identified potential cause of the issue that's now shared with the engineering team. Next update in 30 minutes.

UPDATE 11:45 UTC -  We are waiting for the Snowflake engineering team for the update. The issue seems to only be affecting clone mapping queries. We suspect that the issue might be related to a particular unhealthy server somewhere in the Snowflake infrastructure that the query is assigned to. So while we wait for more information from Snowflake's side, our suggestion for a workaround is to try running the affected job/query again if possible. Next update in 30 minutes.

UPDATE 12:15 UTC -  We are still waiting for the Snowflake engineering team for the update. Next update in 30 minutes.

UPDATE 12:45 UTC -  Snowflake informed us that they are still investigating the issue with their engineering team and keep us informed. Next update in 30 minutes.

UPDATE 13:15 UTC - Snowflake informed us they turned off the unhealthy server and asked us to test it. Our testing transformations work so the issue should be resolved now. We wait for Snowflake to confirm it is resolved. However we encourage to run your failed transformations.

UPDATE 14:10 UTC - Snowflake confirmed the issue should be resolved, at the same time continue investigating for the root cause. After more extensive testing we consider this to be resolved, that means all transformations in EU should succeed without any slowdown or temporal fail.

UPDATE 2021-01-19 11:18 UTC - We published post-mortem for this incident

Orchestration and transformation errors in EU AWS region

Dec 26, 2020 19:32 - We are investigating higher error rate of orchestrations and transformations in AWS EU region.  Next update in 30 minutes.

Dec 26, 2020 19:40  - We've identified the issue on one of the worker nodes and we triggered the replacement. Everything should be back to normal. We'll keep monitoring our platform closely.

Dec 26, 2020 20:08  - Incident is now resolved. We apologize for the inconvenience caused by this incident.