Snowflake Slowdown in the US Region

Friday, 24 April 2020 14:42 UTC We're seeing a higher load and longer execution time in US Snowflake queries. We have added more compute capacity and investigating the causes. Next update in two hours.

Update 18:16 UTC: We're still seeing degraded performance in Snowflake in US region and we're investigating with Snowflake support. Next update in 2 hours.

Update 20:22 UTC: We are working with snowflake on reducing the queueing in our warehouse. We had to pause jobs execution at 20:00 UTC to reduce the influx of queries. When the queue is worked through we'll reenable the jobs.

Update 20:51 UTC: We reenabled the paused job queue with limited throughput and we're monitoring the Snowflake queue closely. So far we see no queueing. Next update in 2 hours. 

Update 22:21 UTC: Job queue is running at full capacity and there are no queries waiting in Snowflake warehouse. Preliminary analysis suggests that the issue was probably caused by a congestion in Snowflake's Cloud Service Layer, but it took Snowflake team some time to find out the root cause and fix it. Some jobs were delayed and some queries timed out resulting in job failures. Those jobs will need to be restarted. We're sorry for the problems this might have caused.

Snowflake Slowdown in EU

Monday, 20 April 2020 07:39:02 UTC: We're seeing degraded performance of Snowflake in EU region, we're investigating the cause with Snowflake. Next update in 1 hour.

Update 08:17:25 UTC: We have thrown more computing power in and the average running times are back to normal. We're still seeing some occasional isolated queries that take longer. We're still working with Snowflake on identifying and resolving the issue, but Keboola Connection is stable now. Next update in 4 hours.

Update 11:31:30 UTC: We still observe slight slowdown in some queries, while other queries run smoothly. From our analytics it seems that job run times are not affected as we've offset the slowdown with more computing power. Next update in 4 hours.

Update 15:33:10 UTC: No significant changes, the situation is stable, but not resolved. Snowflake is working on identifying the source of the performance issues. We're monitoring the situation and in case of significant slow downs we'll offset with more computational power. Next update tomorrow or earlier if there are any changes.

Update 21 April 2020: The situation is stable, we're working with Snowflake on maintaining the stability.

Update 22 April 2020: Snowflake engineers improved performance of impacted queries, together we're working on preventing this in future. We consider the incident closed. Postmortem will be published when we the root cause is fully understood.

Snowflake Job Delays in the US Region

In the early morning Snowflake had an incident in their US West region which caused a large backlog of job processing in Keboola's US Region.  The jobs were all eventually processed, but they may have taken much longer than what you normally experience.

The buildup in our queue began just before 2:00AM CEST and started to ease after 4:30AM  CEST.

Please refer to the above link for further information, and we will add a link to the RCA when it becomes available.

Transformation failures - Post-Mortem

Summary

Between March 30, 20:58 UTC and March 31, 6:15 UTC, some transformation jobs failed with an internal error. About 2% of all transformation jobs were affected. We sincerely apologize for this incident.

What Happened?

On March 30 at 20:58 UTC, we deployed a new version of the Transformation service which contained updated Snowflake ODBC drivers. The update was enforced by Snowflake as a security update patch. Unfortunately, the new version of the driver contained a critical bug which caused the driver to crash when some queries were running longer than one hour. This led to failed transformation jobs.

What Are We Doing About This?

We now treat all driver updates as major updates. This means they go through more careful deployment and monitoring so that we can detect possible problems faster. In the long term, we're working with Snowflake to update drivers in a more controlled manner.


Incident with Snowflake in the US Region

We are currently investigating an increased error rate from snowflake in the US region from approximately 10:00PM CEST.

We will update here as soon as we know more.

UPDATE 11:05 PM CEST: We are handling the issue with Snowflake support. So far all Snowflake operations in US region seem to be failing. Next update at 11:30 PM or sooner if there are any new information or situation changes.

UPDATE 11:30 PM CEST: Snowflake rolled back the release they made today and everything has returned to working condition.

UPDATE 12:00 PM CEST: We're very sorry for this inconvenience. The error started at 12:58 PST (19:58 PM UTC) and lasted until 14:24 PST (21:24 PM UTC). All new Snowflake connections in the US (including those from your DB clients) were failing during the period.

Unfortunately you will need to restart any failed jobs or orchestrations from this time period.

EU region was not affected by this issue.

Snowflake Slowdown in EU

A scaling script running at 12:00 AM CEST failed to scale up the Snowflake warehouse in EU region. All storage and transformation jobs in the EU were affected by this issue and were significantly slower than usual. 

To help process the queued load we have scaled up the warehouse at 9:45 AM CEST and will keep it running until all load is processed.

We're sorry for this inconvenience and we'll be implementing safeguards to prevent this from happening again. 

Degraded Snowflake Performance (EU region) - April 8, 2020

We are investigating decreased performance of Snowflake in EU region which unfortunately reoccured after previous resolutions. We are in touch in with Snowflake support. Job performance and sandbox loading times may be affected. Next update at 12:30pm UTC.

Update 12:30 UTC: We are handling the performance issue with Snowflake support, we've offset the slowdown by scaling up the cluster. We'll have more information in about an hour. We caught the issue early on, so we hope it will have minimal impact on jobs, apart from a small slowdown. So far we've seen 3 job failures because of this across the whole EU region. We'll post another update at 14:30 UTC or sooner if there are any new information or situation changes. 

Update 14:30 UTC: We are still working with Snowflake on resolving the issue. The situation is currently stable and we did not see any jobs failing since the last update. Our main goal is currently to mitigate the issue before the midnight job surge. Next update in 18:30 UTC or sooner if there are any new information or situation changes. 

Update 18:15 UTC: We are still working on mitigating the slowdown. We've seen only 3 related job failures since last update, so we still consider the situation stable. We think that the issue will be resolved in the following hour.

Update 19:30 UTC: We're monitoring the situation and the performance is improved and close to previous values. We should have fresh aggregated monitoring data an aprroximately 15 minutes and we expect them to show complete recovery to the standard performance. 

Update 20:01 UTC: The issue has been resolved. 

Errors in Generic Extractor Post-Mortem

Summary


On April 4, 2020 at 10:07 UTC, we deployed a version of Generic Extractor which contained a bug.
Some Generic Extractor jobs failed with the following error:

CSV file "XXX" file name is not a valid table identifier, either set output mapping for "XXX" or make sure that the file name is a valid Storage table identifier. 

Generic Extractor was reverted to its previous version at 14:08 UTC. The error affected 10% of all Generic Extractor jobs running during the four-hour period. We sincerely apologize for the trouble this may have caused you.

What Happened?

We changed the output generation rules so tables are always generated even if empty. Table names are normally generated using the outputBucket setting. However, it can also be done using undocumented alternative settings via ID or name properties. Unfortunately, the new code did not take the alternative settings into account and failed to generate correct table names.

What Are We Doing About This?

We have extended the tests to cover the undocumented settings, though we recommend you stick with the documented ones.

Errors in Generic Extractor jobs

Today we have released a version of Generic extractor in which a bug was present. It caused certain specific configurations to fail with the error:

CSV file XXX file name is not a valid table identifier, either set output mapping for XXX or make sure that the file name is a valid Storage table identifier. 

We have reverted the release. We sincerely apologize for the error. We will publish a postmortem next week.


Orchestrations API increased error rate in EU

There are some problems causing errors of Orchestrations API responses in EU region. We are investigating and will give here more details in under an hour.

UPDATE Apr 2 11:32 CEST - The errors stopped occurring by now. We are watching it and investigating the root cause.

UPDATE Apr 2 12:05 CEST - We've found out that API servers were flooded with some unexpected requests bursts. We've upgraded the infrastructure and will find a way how to prevent such a situation for next time.