Postmortem: Degraded Snowflake Performance & Failed Jobs

Summary

In the past months, we have been having a number of problems with the Snowflake backend in both the US and the EU region. These were caused by a number of loosely connected issues. For that reason, we've decided to publish a joint post-mortem. 

First, we've seen some rarely failing queries (that's about one failed query in a million) during January. It was unclear exactly what was causing the randomly appearing errors, so we kept investigating. On January 6th, we saw a high increase of this error type (hundreds in a million). We moved to debug this with Snowflake. From this point, we've seen a steady increase of the errors interlaid with days when it didn't occur at all. This made debugging the root cause quite challenging on both our own and the Snowflake side.

In our attempt to resolve the issue, we tried updating the ODBC drivers to different versions as advised. Unfortunately the new drivers suffered from regression issues (now fixed under the reference SNOW-148261 and SNOW-150687). This led to even more errors. On February 27th, Snowflake engineering found that the problem is actually related to the Snowflake Cloud Service Layer and the number of roles we have in our account. In their attempt to resolve the issue, they introduced changes to their service. On March 3rd, this resolved the issues of failing queries but caused slowdowns in the Cloud Service Layer. We suffered particularly severe slowdown on March 3rd (EU) and then milder slowdowns on April 8th (EU) and April 23rd (EU). All of these had the same root cause. We were hit by another slowdown on April 24th in the US and on May 7th in the EU. The last two had somewhat different root causes, and are described below in more detail.

What Happened?

The obvious questions that everyone asks (including us) are "Whose fault is this?"; "What's wrong with Keboola Connection?"; "Were we the only customer affected?" It turns out that the unfortunate events were caused by a conjunction of multiple causes. 

At the base of the pyramid is how we use Snowflake. Our usage pattern is in many aspects atypical. However, this is what we need for the great features of Keboola Connection like reliability, repeatability and auditability. The two most important characteristics (for the course of past events) are a high number of queries and frequent changes to the database roles. These two characteristics produce very high load on the Snowflake Cloud Service Layer (CSL), which is responsible for processing every query and figuring out its permissions. This unusual load for the Snowflake database puts strain on unexpected parts of the Snowflake environment and at some points we're pushing the limits.

The problem is far more complex, though. The load in terms of number of queries is one thing, but the load it creates on the Snowflake CSL is proportional to the complexity of permissions with which it interacts. It is therefore the combination of factors – the number of queries, the roles they use, the types of query and the state of the environment (load from other users, fraction of queries going to the warehouse, latency on CSL, queuing of queries and overhead associated with queuing) – that creates the mix. This is the reason why some projects have been more affected than others. Projects running large numbers of small jobs cause disproportionally higher load (more queries, more permission manipulations, fewer actual computations). Also, they are more affected because even small delays are noticeable in short jobs.

This explains why we were seemingly the only customer affected. We were not. When the queries were failing, it was one in a million at the beginning, and one in ten thousand at its worst. This kind of error rate is completely unnoticeable except in a highly automated and audited environment. That explains why an end-user can be querying the same Snowflake warehouse from a Looker or Tableau dashboard and see no problem and yet at the same time see failed jobs in Keboola Connection. This also applies to the later slowdown incidents. For example, we had a situation when all DCL queries took over 500 milliseconds instead of the usual 100 milliseconds. This is hardly noticeable by most customers, but it has a huge impact on the speed of Keboola Connection jobs, especially on the short ones. These are also the reasons why the incidents are not mentioned on Snowflake's status page. While they were not limited only to us, the impact on most of the other customers was not large enough to cross the necessary threshold. 

The multiple regressions with the ODBC drivers also affected mainly us because we upgraded them hastily as soon as the upgrades were published in an attempt to resolve the original issue. While we were not the only customer using them, we ran high millions of queries through them within a few days. Customers not suffering from the CSL problems kept using the older drivers and were not impacted by this regression.

In more technical detail, a number of operations contributed to the incident. When we do an operation on Storage, we have to establish a connection with the Snowflake database. The database needs to evaluate the permissions of the connecting role. This is done in the Snowflake CSL. It takes care of processing queries which are not operating on data (DDL + DCL) and are not using the warehouse. 

When the issue first appeared, the CSL was dropping queries when it ran out of resources. The cause of this is that we have a complex permission system which we're changing often, thus invalidating a cache on which the CSL performance relies. Nearly every connection therefore needs to reevaluate the permission tree of the connecting role. When Snowflake fixed this so that the queries were not dropped, another problem emerged. The simple fact that some DCL queries took nearly seconds instead of milliseconds caused serious slowdowns of job processing. The slowdown of each query was proportional to the size of the permission settings (number of Storage workspaces mainly) and the amount of traffic in the project. At some point, the slowdown was so intense, that queries were waiting seconds to be received by the CSL.

The CSL also prepares the queries for each warehouse. Our application is "CSL intensive," which means that we are affected by even small performance degradations of the CSL (even if they are barely noticeable for other Snowflake customers). This is what happened in the last two incidents described below. 

Apart from all this, we were also hit by a number of smaller issues (e.g. login failure) which are completely unrelated – they were just strokes of bad luck.

US Incident on April 24th

On 2020-04-24 at 9:12 UTC, we noticed reduced performance of a Snowflake warehouse in our US region and opened a ticket with Snowflake. At 12:00 UTC, the US warehouse started queuing queries at the usual Friday peak time. What seemed like normal peak time, which lasts a couple of hours, turned into an overloaded warehouse where queries were executing slower and slower. Multiple attempts to scale up the warehouse didn't help so we escalated the ticket on Snowflake. We had to stop executing jobs to pause the load to the warehouse to give it time to recover. Snowflake engineering then boosted the resources in their CSL to avoid repeating the issue.

Multiple factors contributed to the incident. The performance of the CSL was worse than usual that day, which was noticeable, but it was not enough to trigger an alarm on the Snowflake side. This was combined with slightly higher load from our side and the fact that the Snowflake CSL cannot be scaled by boosting the warehouse. At some point, the warehouse reached the situation where so many queries were queued that the CSL spent more time requeuing the queries than actually executing them.

EU Incident on May 5th

On 2020-05-08 at 8:30 UTC, we noticed reduced performance of a Snowflake warehouse in our EU region. Since we had already encountered a similar issue in the US region, we immediately took steps to reduce the load and avoid overloading the warehouse in the first place. This led to longer waiting times in jobs, but it allowed us to execute jobs during the whole incident. We raised the issue with Snowflake and, once they'd discovered the root cause, they applied a fix that resolved the issue. The root cause was uneven distribution of queries in the CSL, which led to an overload and subsequent crash of the underlying machine. With the uneven distribution bug in place, there was not enough computational power in the CSL part allocated to us. While the root cause is different from that of the US incident, the symptoms were the same and so were the reasons why this wasn't a platform-wide incident on Snowflake.

What Are We Doing About This?

First, we're working intensively with Snowflake. During the past few months, both we and Snowflake have learned to measure, detect and ideally avoid this kind of incident. We have both improved our processes for handling CSL issues. While it took more than a month to resolve the first problem, it took us only two hours to resolve the last incident. We both went down the long path of discovering, debugging and untangling a complex issue and we both gained valuable knowledge, albeit at a high price.

We're engaged in discussions with Snowflake engineering in order to better understand the implications of each other's design decisions. We have learned a lot about what limits we are nearing and what can be done about them. Snowflake engineering understands our usage pattern and is taking steps to keep the CSL more stable. We understand what internal limits we're nearing and what we should do to avoid exceeding them. In the long term, we're working on adjusting our design and usage patterns to better match how Snowflake is set up. We will do it without modifying the way Keboola Connection works for you. In the short term, we've updated our maintenance procedures to be able to detect these issues earlier and then to act more quickly, should something similar reoccur. In the short term, Snowflake have added additional resources to the Snowflake CSL and improved monitoring to prevent these issues from occurring again. In the long term, Snowflake are aiming to make the Cloud Service Layer more scalable. 

We've already taken a number of small steps; specifically:

  • We found a bug in a transformation service that caused some roles to be left over. This is already fixed and the number of unused workspaces is slowly decreasing.

  • We'll proceed to clear the rest of the unused database roles in a one-time cleanup. This, along with the previous step, should improve CSL performance on the most affected projects.

  • We've agreed with Snowflake about changes to make to the ODBC driver management to minimize the impact of any future regressions.

  • We're currently checking whether we can implement changes to our usage pattern, as suggested by Snowflake. 

To be absolutely honest, we can't say that the problem is solved, but we now understand the causes and how to mitigate them. There is still a lot of technical work ahead of us. However, we are confident that, if the incidents repeat, we can manage them with less and less impact until they are not noticeable to you. We're really sorry that we haven't delivered the performance you are used to recently. We have all hands on deck, though, to prepare and deliver a permanent fix as soon as possible. In the current hard times, patience is scarce, but we hope you will be patient with us for a bit longer as we tackle the work needed.