Postmortem: Incident with Snowflake in the US Region

Summary

On April 14 between 19:58 and 21:23 UTC, the US Snowflake backend became unavailable. All jobs working with a Snowflake database failed with an internal error. Logging into workspaces was not possible either.

What Happened?

On April 14, Snowflake created a new release with an issue in the authentication process. This resulted in the inability to create a new database session for the affected accounts. The release was deployed gradually, which is the reason why only some accounts were affected. The release was rolled back by Snowflake.

What Are We Doing About This?

We are terribly sorry, but we can't really do anything. This is out of our hands.

Detailed explanation from Snowflake

When a user tries to authenticate, the Snowflake cloud service layer creates a session object that lists all the roles for the user. As this amounted to a large number in the Keboola account, it exposed a resource leak in our 4.12 release that resulted in users not being able to log in.

Other customers were not impacted as their role hierarchy did not trigger the same code path.

As an immediate remediation, Snowflake rolled back the affected release and disabled the code path, which was protected by a parameter.

As part of the post-mortem, a test was added to our test suite that better captures this role configuration. Additionally, logging was put in place to make detection of this type of corner case easier to diagnose.