Postmortem: Degraded Snowflake Performance & Failed Jobs

Summary

In the past months, we have been having a number of problems with the Snowflake backend in both the US and the EU region. These were caused by a number of loosely connected issues. For that reason, we've decided to publish a joint post-mortem. 

First, we've seen some rarely failing queries (that's about one failed query in a million) during January. It was unclear exactly what was causing the randomly appearing errors, so we kept investigating. On January 6th, we saw a high increase of this error type (hundreds in a million). We moved to debug this with Snowflake. From this point, we've seen a steady increase of the errors interlaid with days when it didn't occur at all. This made debugging the root cause quite challenging on both our own and the Snowflake side.

In our attempt to resolve the issue, we tried updating the ODBC drivers to different versions as advised. Unfortunately the new drivers suffered from regression issues (now fixed under the reference SNOW-148261 and SNOW-150687). This led to even more errors. On February 27th, Snowflake engineering found that the problem is actually related to the Snowflake Cloud Service Layer and the number of roles we have in our account. In their attempt to resolve the issue, they introduced changes to their service. On March 3rd, this resolved the issues of failing queries but caused slowdowns in the Cloud Service Layer. We suffered particularly severe slowdown on March 3rd (EU) and then milder slowdowns on April 8th (EU) and April 23rd (EU). All of these had the same root cause. We were hit by another slowdown on April 24th in the US and on May 7th in the EU. The last two had somewhat different root causes, and are described below in more detail.

What Happened?

The obvious questions that everyone asks (including us) are "Whose fault is this?"; "What's wrong with Keboola Connection?"; "Were we the only customer affected?" It turns out that the unfortunate events were caused by a conjunction of multiple causes. 

At the base of the pyramid is how we use Snowflake. Our usage pattern is in many aspects atypical. However, this is what we need for the great features of Keboola Connection like reliability, repeatability and auditability. The two most important characteristics (for the course of past events) are a high number of queries and frequent changes to the database roles. These two characteristics produce very high load on the Snowflake Cloud Service Layer (CSL), which is responsible for processing every query and figuring out its permissions. This unusual load for the Snowflake database puts strain on unexpected parts of the Snowflake environment and at some points we're pushing the limits.

The problem is far more complex, though. The load in terms of number of queries is one thing, but the load it creates on the Snowflake CSL is proportional to the complexity of permissions with which it interacts. It is therefore the combination of factors – the number of queries, the roles they use, the types of query and the state of the environment (load from other users, fraction of queries going to the warehouse, latency on CSL, queuing of queries and overhead associated with queuing) – that creates the mix. This is the reason why some projects have been more affected than others. Projects running large numbers of small jobs cause disproportionally higher load (more queries, more permission manipulations, fewer actual computations). Also, they are more affected because even small delays are noticeable in short jobs.

This explains why we were seemingly the only customer affected. We were not. When the queries were failing, it was one in a million at the beginning, and one in ten thousand at its worst. This kind of error rate is completely unnoticeable except in a highly automated and audited environment. That explains why an end-user can be querying the same Snowflake warehouse from a Looker or Tableau dashboard and see no problem and yet at the same time see failed jobs in Keboola Connection. This also applies to the later slowdown incidents. For example, we had a situation when all DCL queries took over 500 milliseconds instead of the usual 100 milliseconds. This is hardly noticeable by most customers, but it has a huge impact on the speed of Keboola Connection jobs, especially on the short ones. These are also the reasons why the incidents are not mentioned on Snowflake's status page. While they were not limited only to us, the impact on most of the other customers was not large enough to cross the necessary threshold. 

The multiple regressions with the ODBC drivers also affected mainly us because we upgraded them hastily as soon as the upgrades were published in an attempt to resolve the original issue. While we were not the only customer using them, we ran high millions of queries through them within a few days. Customers not suffering from the CSL problems kept using the older drivers and were not impacted by this regression.

In more technical detail, a number of operations contributed to the incident. When we do an operation on Storage, we have to establish a connection with the Snowflake database. The database needs to evaluate the permissions of the connecting role. This is done in the Snowflake CSL. It takes care of processing queries which are not operating on data (DDL + DCL) and are not using the warehouse. 

When the issue first appeared, the CSL was dropping queries when it ran out of resources. The cause of this is that we have a complex permission system which we're changing often, thus invalidating a cache on which the CSL performance relies. Nearly every connection therefore needs to reevaluate the permission tree of the connecting role. When Snowflake fixed this so that the queries were not dropped, another problem emerged. The simple fact that some DCL queries took nearly seconds instead of milliseconds caused serious slowdowns of job processing. The slowdown of each query was proportional to the size of the permission settings (number of Storage workspaces mainly) and the amount of traffic in the project. At some point, the slowdown was so intense, that queries were waiting seconds to be received by the CSL.

The CSL also prepares the queries for each warehouse. Our application is "CSL intensive," which means that we are affected by even small performance degradations of the CSL (even if they are barely noticeable for other Snowflake customers). This is what happened in the last two incidents described below. 

Apart from all this, we were also hit by a number of smaller issues (e.g. login failure) which are completely unrelated – they were just strokes of bad luck.

US Incident on April 24th

On 2020-04-24 at 9:12 UTC, we noticed reduced performance of a Snowflake warehouse in our US region and opened a ticket with Snowflake. At 12:00 UTC, the US warehouse started queuing queries at the usual Friday peak time. What seemed like normal peak time, which lasts a couple of hours, turned into an overloaded warehouse where queries were executing slower and slower. Multiple attempts to scale up the warehouse didn't help so we escalated the ticket on Snowflake. We had to stop executing jobs to pause the load to the warehouse to give it time to recover. Snowflake engineering then boosted the resources in their CSL to avoid repeating the issue.

Multiple factors contributed to the incident. The performance of the CSL was worse than usual that day, which was noticeable, but it was not enough to trigger an alarm on the Snowflake side. This was combined with slightly higher load from our side and the fact that the Snowflake CSL cannot be scaled by boosting the warehouse. At some point, the warehouse reached the situation where so many queries were queued that the CSL spent more time requeuing the queries than actually executing them.

EU Incident on May 5th

On 2020-05-08 at 8:30 UTC, we noticed reduced performance of a Snowflake warehouse in our EU region. Since we had already encountered a similar issue in the US region, we immediately took steps to reduce the load and avoid overloading the warehouse in the first place. This led to longer waiting times in jobs, but it allowed us to execute jobs during the whole incident. We raised the issue with Snowflake and, once they'd discovered the root cause, they applied a fix that resolved the issue. The root cause was uneven distribution of queries in the CSL, which led to an overload and subsequent crash of the underlying machine. With the uneven distribution bug in place, there was not enough computational power in the CSL part allocated to us. While the root cause is different from that of the US incident, the symptoms were the same and so were the reasons why this wasn't a platform-wide incident on Snowflake.

What Are We Doing About This?

First, we're working intensively with Snowflake. During the past few months, both we and Snowflake have learned to measure, detect and ideally avoid this kind of incident. We have both improved our processes for handling CSL issues. While it took more than a month to resolve the first problem, it took us only two hours to resolve the last incident. We both went down the long path of discovering, debugging and untangling a complex issue and we both gained valuable knowledge, albeit at a high price.

We're engaged in discussions with Snowflake engineering in order to better understand the implications of each other's design decisions. We have learned a lot about what limits we are nearing and what can be done about them. Snowflake engineering understands our usage pattern and is taking steps to keep the CSL more stable. We understand what internal limits we're nearing and what we should do to avoid exceeding them. In the long term, we're working on adjusting our design and usage patterns to better match how Snowflake is set up. We will do it without modifying the way Keboola Connection works for you. In the short term, we've updated our maintenance procedures to be able to detect these issues earlier and then to act more quickly, should something similar reoccur. In the short term, Snowflake have added additional resources to the Snowflake CSL and improved monitoring to prevent these issues from occurring again. In the long term, Snowflake are aiming to make the Cloud Service Layer more scalable. 

We've already taken a number of small steps; specifically:

  • We found a bug in a transformation service that caused some roles to be left over. This is already fixed and the number of unused workspaces is slowly decreasing.

  • We'll proceed to clear the rest of the unused database roles in a one-time cleanup. This, along with the previous step, should improve CSL performance on the most affected projects.

  • We've agreed with Snowflake about changes to make to the ODBC driver management to minimize the impact of any future regressions.

  • We're currently checking whether we can implement changes to our usage pattern, as suggested by Snowflake. 

To be absolutely honest, we can't say that the problem is solved, but we now understand the causes and how to mitigate them. There is still a lot of technical work ahead of us. However, we are confident that, if the incidents repeat, we can manage them with less and less impact until they are not noticeable to you. We're really sorry that we haven't delivered the performance you are used to recently. We have all hands on deck, though, to prepare and deliver a permanent fix as soon as possible. In the current hard times, patience is scarce, but we hope you will be patient with us for a bit longer as we tackle the work needed. 


Postmortem: Incident with Snowflake in the US Region

Summary

On April 14 between 19:58 and 21:23 UTC, the US Snowflake backend became unavailable. All jobs working with a Snowflake database failed with an internal error. Logging into workspaces was not possible either.

What Happened?

On April 14, Snowflake created a new release with an issue in the authentication process. This resulted in the inability to create a new database session for the affected accounts. The release was deployed gradually, which is the reason why only some accounts were affected. The release was rolled back by Snowflake.

What Are We Doing About This?

We are terribly sorry, but we can't really do anything. This is out of our hands.

Detailed explanation from Snowflake

When a user tries to authenticate, the Snowflake cloud service layer creates a session object that lists all the roles for the user. As this amounted to a large number in the Keboola account, it exposed a resource leak in our 4.12 release that resulted in users not being able to log in.

Other customers were not impacted as their role hierarchy did not trigger the same code path.

As an immediate remediation, Snowflake rolled back the affected release and disabled the code path, which was protected by a parameter.

As part of the post-mortem, a test was added to our test suite that better captures this role configuration. Additionally, logging was put in place to make detection of this type of corner case easier to diagnose.


MySQL Extractor errors

Today we have released a new version of MySQL extractor in which a bug was present.

It caused errors in the UI:

Decoding JSON response from component failed: Syntax error

It also affected jobs of this extractor. Although the jobs seemed to finish successfully, they didn't process any data.

The flawed version was released at 12:14 and reverted at 15:09 CET.


UPDATE:

Another version deployed on May 12th 9:25 introduced another bug, which affected certain queries, resulting in error message:

DB query failed: Trying to access array offset on value of type null

We have reverted this release today on May 13th 11:26.

We sincerely apologize for the errors. A postmortem reports will follow with further details.

Snowflake Slowdown in the EU Region

7 May 2020 8:30 UTC We're seeing a higher load and longer execution time in EU Snowflake queries. We have added more compute capacity and investigating the causes. Next update in two hours.

7 May 2020 9:50 UTC The performance should be back to normal, we're monitoring the situation.

7 May 2020 11:00 UTC We're seeing again slower execution, we're working with Snowflake on resolving the issue. Next update in two hours.

7 May 2020 12:13 UTC Snowflake engineering identified the cause of the reduced performance, we're now processing the backlog. There are still some queued orchestrations, but the run times of individual jobs are back to normal. Both us and Snowflake engineering are monitoring the load. Next update in two hours.

The incident is resolved.


Weeks in review -- April 2020

New Changes in the UI

  • Transformation script editing can now be done in fullscreen mode.

Normal mode:

Fullscreen mode


  • Database writers now have newly improved input mappings


  • The shared bucket detail now shows who shared it (if applicable)


  • And the sandbox modals have been cleaned up:

New Components:

  •  Active Campaign :  Use this component to gather information on your campaigns from your Active Campaign account.


Updated Components:

  • MySQL extractor now properly handles utf8mb4 emojis 
  • Data Warehouse Manager now allows password reset for schema users

TLS Security Update

As of May 12, 2020, Transport Layer Security (TLS) 1.0 and 1.1 will no longer be supported for securing connections to Keboola Connection endpoints.

The vast majority of HTTPS connections made to KBC endpoints use TLS 1.2 and will not be affected. This includes every currently shipping browser used by KBC users. 

We have separately contacted all affected projects. If you did not hear from us then no action is required

If you have any questions or concerns related to this announcement, please don’t hesitate to contact us.

Snowflake Slowdown in the US Region

Friday, 24 April 2020 14:42 UTC We're seeing a higher load and longer execution time in US Snowflake queries. We have added more compute capacity and investigating the causes. Next update in two hours.

Update 18:16 UTC: We're still seeing degraded performance in Snowflake in US region and we're investigating with Snowflake support. Next update in 2 hours.

Update 20:22 UTC: We are working with snowflake on reducing the queueing in our warehouse. We had to pause jobs execution at 20:00 UTC to reduce the influx of queries. When the queue is worked through we'll reenable the jobs.

Update 20:51 UTC: We reenabled the paused job queue with limited throughput and we're monitoring the Snowflake queue closely. So far we see no queueing. Next update in 2 hours. 

Update 22:21 UTC: Job queue is running at full capacity and there are no queries waiting in Snowflake warehouse. Preliminary analysis suggests that the issue was probably caused by a congestion in Snowflake's Cloud Service Layer, but it took Snowflake team some time to find out the root cause and fix it. Some jobs were delayed and some queries timed out resulting in job failures. Those jobs will need to be restarted. We're sorry for the problems this might have caused.

Snowflake Slowdown in EU

Monday, 20 April 2020 07:39:02 UTC: We're seeing degraded performance of Snowflake in EU region, we're investigating the cause with Snowflake. Next update in 1 hour.

Update 08:17:25 UTC: We have thrown more computing power in and the average running times are back to normal. We're still seeing some occasional isolated queries that take longer. We're still working with Snowflake on identifying and resolving the issue, but Keboola Connection is stable now. Next update in 4 hours.

Update 11:31:30 UTC: We still observe slight slowdown in some queries, while other queries run smoothly. From our analytics it seems that job run times are not affected as we've offset the slowdown with more computing power. Next update in 4 hours.

Update 15:33:10 UTC: No significant changes, the situation is stable, but not resolved. Snowflake is working on identifying the source of the performance issues. We're monitoring the situation and in case of significant slow downs we'll offset with more computational power. Next update tomorrow or earlier if there are any changes.

Update 21 April 2020: The situation is stable, we're working with Snowflake on maintaining the stability.

Update 22 April 2020: Snowflake engineers improved performance of impacted queries, together we're working on preventing this in future. We consider the incident closed. Postmortem will be published when we the root cause is fully understood.

Snowflake Job Delays in the US Region

In the early morning Snowflake had an incident in their US West region which caused a large backlog of job processing in Keboola's US Region.  The jobs were all eventually processed, but they may have taken much longer than what you normally experience.

The buildup in our queue began just before 2:00AM CEST and started to ease after 4:30AM  CEST.

Please refer to the above link for further information, and we will add a link to the RCA when it becomes available.

Transformation failures - Post-Mortem

Summary

Between March 30, 20:58 UTC and March 31, 6:15 UTC, some transformation jobs failed with an internal error. About 2% of all transformation jobs were affected. We sincerely apologize for this incident.

What Happened?

On March 30 at 20:58 UTC, we deployed a new version of the Transformation service which contained updated Snowflake ODBC drivers. The update was enforced by Snowflake as a security update patch. Unfortunately, the new version of the driver contained a critical bug which caused the driver to crash when some queries were running longer than one hour. This led to failed transformation jobs.

What Are We Doing About This?

We now treat all driver updates as major updates. This means they go through more careful deployment and monitoring so that we can detect possible problems faster. In the long term, we're working with Snowflake to update drivers in a more controlled manner.