Transformation failures - Post-Mortem

Summary

Between March 30, 20:58 UTC and March 31, 6:15 UTC, some transformation jobs failed with an internal error. About 2% of all transformation jobs were affected. We sincerely apologize for this incident.

What Happened?

On March 30 at 20:58 UTC, we deployed a new version of the Transformation service which contained updated Snowflake ODBC drivers. The update was enforced by Snowflake as a security update patch. Unfortunately, the new version of the driver contained a critical bug which caused the driver to crash when some queries were running longer than one hour. This led to failed transformation jobs.

What Are We Doing About This?

We now treat all driver updates as major updates. This means they go through more careful deployment and monitoring so that we can detect possible problems faster. In the long term, we're working with Snowflake to update drivers in a more controlled manner.


Incident with Snowflake in the US Region

We are currently investigating an increased error rate from snowflake in the US region from approximately 10:00PM CEST.

We will update here as soon as we know more.

UPDATE 11:05 PM CEST: We are handling the issue with Snowflake support. So far all Snowflake operations in US region seem to be failing. Next update at 11:30 PM or sooner if there are any new information or situation changes.

UPDATE 11:30 PM CEST: Snowflake rolled back the release they made today and everything has returned to working condition.

UPDATE 12:00 PM CEST: We're very sorry for this inconvenience. The error started at 12:58 PST (19:58 PM UTC) and lasted until 14:24 PST (21:24 PM UTC). All new Snowflake connections in the US (including those from your DB clients) were failing during the period.

Unfortunately you will need to restart any failed jobs or orchestrations from this time period.

EU region was not affected by this issue.

Errors in Generic Extractor Post-Mortem

Summary


On April 4, 2020 at 10:07 UTC, we deployed a version of Generic Extractor which contained a bug.
Some Generic Extractor jobs failed with the following error:

CSV file "XXX" file name is not a valid table identifier, either set output mapping for "XXX" or make sure that the file name is a valid Storage table identifier. 

Generic Extractor was reverted to its previous version at 14:08 UTC. The error affected 10% of all Generic Extractor jobs running during the four-hour period. We sincerely apologize for the trouble this may have caused you.

What Happened?

We changed the output generation rules so tables are always generated even if empty. Table names are normally generated using the outputBucket setting. However, it can also be done using undocumented alternative settings via ID or name properties. Unfortunately, the new code did not take the alternative settings into account and failed to generate correct table names.

What Are We Doing About This?

We have extended the tests to cover the undocumented settings, though we recommend you stick with the documented ones.

Errors in Generic Extractor jobs

Today we have released a version of Generic extractor in which a bug was present. It caused certain specific configurations to fail with the error:

CSV file XXX file name is not a valid table identifier, either set output mapping for XXX or make sure that the file name is a valid Storage table identifier. 

We have reverted the release. We sincerely apologize for the error. We will publish a postmortem next week.


Orchestrations API increased error rate in EU

There are some problems causing errors of Orchestrations API responses in EU region. We are investigating and will give here more details in under an hour.

UPDATE Apr 2 11:32 CEST - The errors stopped occurring by now. We are watching it and investigating the root cause.

UPDATE Apr 2 12:05 CEST - We've found out that API servers were flooded with some unexpected requests bursts. We've upgraded the infrastructure and will find a way how to prevent such a situation for next time.

Week in Review - March 31th, 2020

UI Improvements

  • Action buttons are now directly accessible when hovering over list items in transformations and components which use generic input or output mappings.
  • We added a new modal to improve the orchestration set-up experience.  You can now more easily schedule orchestrations on an hourly, daily or weekly basis. There's still an option to set up a custom schedule.
  • When you want to edit tables or edit credentials in your database writers, you no longer have to click on the “Edit” button, you can directly edit the values and push “Save“ button.
  • We added a new modal for database writers that support provisioned credentials(Redshift, Snowflake). You can now directly create provisioned credentials.

Minor Improvements

  • Julia transformation and sandbox have been updated to julia1.4


Transformation failures

We’re currently experiencing a transformation failures, we are investigating the problem. Next update in one hour.

UPDATE March 31, 6:17 AM UTC: We've identified the issue and deployed rollback. Transformations started after 6:11 AM should run without any issues. We’re monitoring to ensure transformations are running as expected. Next update in one hour.

UPDATE March 31, 7:30 AM UTC: The rollback was finished and no other issues were reported within the last hour. We are going to investigate the root cause and publish post mortem soon.

Transformation errors

Since March 26, 4:00 PM UTC we are experiencing failures for starting transformations in US and EU regions with error Storage API bucket 'configuration_id' with configuration not found.

Error was caused by incorrect configuration.

We're investigating the issue and will update this post with our findings.

We apologize for the inconvenience.

UPDATE March 26, 4:35 PM UTC: Problem was fixed.

Degraded Snowflake Performance (US region) - March 24, 2020

Since March 24, 8:15 am UTC we are seeing decreased performance of Snowflake in US region. That may cause degradation in performance jobs and sandbox loading in US region. We are investigating the causes. Next update in one hour.

UPDATE Mar 24, 10:10 UTC  - Performance should be back to normal, we're closely monitoring the situation.