Snowflake Job Delays in the US Region

In the early morning Snowflake had an incident in their US West region which caused a large backlog of job processing in Keboola's US Region.  The jobs were all eventually processed, but they may have taken much longer than what you normally experience.

The buildup in our queue began just before 2:00AM CEST and started to ease after 4:30AM  CEST.

Please refer to the above link for further information, and we will add a link to the RCA when it becomes available.

Transformation failures - Post-Mortem

Summary

Between March 30, 20:58 UTC and March 31, 6:15 UTC, some transformation jobs failed with an internal error. About 2% of all transformation jobs were affected. We sincerely apologize for this incident.

What Happened?

On March 30 at 20:58 UTC, we deployed a new version of the Transformation service which contained updated Snowflake ODBC drivers. The update was enforced by Snowflake as a security update patch. Unfortunately, the new version of the driver contained a critical bug which caused the driver to crash when some queries were running longer than one hour. This led to failed transformation jobs.

What Are We Doing About This?

We now treat all driver updates as major updates. This means they go through more careful deployment and monitoring so that we can detect possible problems faster. In the long term, we're working with Snowflake to update drivers in a more controlled manner.


Incident with Snowflake in the US Region

We are currently investigating an increased error rate from snowflake in the US region from approximately 10:00PM CEST.

We will update here as soon as we know more.

UPDATE 11:05 PM CEST: We are handling the issue with Snowflake support. So far all Snowflake operations in US region seem to be failing. Next update at 11:30 PM or sooner if there are any new information or situation changes.

UPDATE 11:30 PM CEST: Snowflake rolled back the release they made today and everything has returned to working condition.

UPDATE 12:00 PM CEST: We're very sorry for this inconvenience. The error started at 12:58 PST (19:58 PM UTC) and lasted until 14:24 PST (21:24 PM UTC). All new Snowflake connections in the US (including those from your DB clients) were failing during the period.

Unfortunately you will need to restart any failed jobs or orchestrations from this time period.

EU region was not affected by this issue.

Snowflake Slowdown in EU

A scaling script running at 12:00 AM CEST failed to scale up the Snowflake warehouse in EU region. All storage and transformation jobs in the EU were affected by this issue and were significantly slower than usual. 

To help process the queued load we have scaled up the warehouse at 9:45 AM CEST and will keep it running until all load is processed.

We're sorry for this inconvenience and we'll be implementing safeguards to prevent this from happening again. 

Degraded Snowflake Performance (EU region) - April 8, 2020

We are investigating decreased performance of Snowflake in EU region which unfortunately reoccured after previous resolutions. We are in touch in with Snowflake support. Job performance and sandbox loading times may be affected. Next update at 12:30pm UTC.

Update 12:30 UTC: We are handling the performance issue with Snowflake support, we've offset the slowdown by scaling up the cluster. We'll have more information in about an hour. We caught the issue early on, so we hope it will have minimal impact on jobs, apart from a small slowdown. So far we've seen 3 job failures because of this across the whole EU region. We'll post another update at 14:30 UTC or sooner if there are any new information or situation changes. 

Update 14:30 UTC: We are still working with Snowflake on resolving the issue. The situation is currently stable and we did not see any jobs failing since the last update. Our main goal is currently to mitigate the issue before the midnight job surge. Next update in 18:30 UTC or sooner if there are any new information or situation changes. 

Update 18:15 UTC: We are still working on mitigating the slowdown. We've seen only 3 related job failures since last update, so we still consider the situation stable. We think that the issue will be resolved in the following hour.

Update 19:30 UTC: We're monitoring the situation and the performance is improved and close to previous values. We should have fresh aggregated monitoring data an aprroximately 15 minutes and we expect them to show complete recovery to the standard performance. 

Update 20:01 UTC: The issue has been resolved. 

Errors in Generic Extractor Post-Mortem

Summary


On April 4, 2020 at 10:07 UTC, we deployed a version of Generic Extractor which contained a bug.
Some Generic Extractor jobs failed with the following error:

CSV file "XXX" file name is not a valid table identifier, either set output mapping for "XXX" or make sure that the file name is a valid Storage table identifier. 

Generic Extractor was reverted to its previous version at 14:08 UTC. The error affected 10% of all Generic Extractor jobs running during the four-hour period. We sincerely apologize for the trouble this may have caused you.

What Happened?

We changed the output generation rules so tables are always generated even if empty. Table names are normally generated using the outputBucket setting. However, it can also be done using undocumented alternative settings via ID or name properties. Unfortunately, the new code did not take the alternative settings into account and failed to generate correct table names.

What Are We Doing About This?

We have extended the tests to cover the undocumented settings, though we recommend you stick with the documented ones.

Errors in Generic Extractor jobs

Today we have released a version of Generic extractor in which a bug was present. It caused certain specific configurations to fail with the error:

CSV file XXX file name is not a valid table identifier, either set output mapping for XXX or make sure that the file name is a valid Storage table identifier. 

We have reverted the release. We sincerely apologize for the error. We will publish a postmortem next week.


Orchestrations API increased error rate in EU

There are some problems causing errors of Orchestrations API responses in EU region. We are investigating and will give here more details in under an hour.

UPDATE Apr 2 11:32 CEST - The errors stopped occurring by now. We are watching it and investigating the root cause.

UPDATE Apr 2 12:05 CEST - We've found out that API servers were flooded with some unexpected requests bursts. We've upgraded the infrastructure and will find a way how to prevent such a situation for next time.

Week in Review - March 31th, 2020

UI Improvements

  • Action buttons are now directly accessible when hovering over list items in transformations and components which use generic input or output mappings.
  • We added a new modal to improve the orchestration set-up experience.  You can now more easily schedule orchestrations on an hourly, daily or weekly basis. There's still an option to set up a custom schedule.
  • When you want to edit tables or edit credentials in your database writers, you no longer have to click on the “Edit” button, you can directly edit the values and push “Save“ button.
  • We added a new modal for database writers that support provisioned credentials(Redshift, Snowflake). You can now directly create provisioned credentials.

Minor Improvements

  • Julia transformation and sandbox have been updated to julia1.4


Transformation failures

We’re currently experiencing a transformation failures, we are investigating the problem. Next update in one hour.

UPDATE March 31, 6:17 AM UTC: We've identified the issue and deployed rollback. Transformations started after 6:11 AM should run without any issues. We’re monitoring to ensure transformations are running as expected. Next update in one hour.

UPDATE March 31, 7:30 AM UTC: The rollback was finished and no other issues were reported within the last hour. We are going to investigate the root cause and publish post mortem soon.