tag:status.keboola.com,2013:/posts Keboola Status 2020-07-11T18:51:56Z Keboola Connection "Data Framework" tag:status.keboola.com,2013:Post/1571546 2020-07-10T08:55:42Z 2020-07-11T18:51:56Z Snowflake Slowdown in the EU Region

10 Jul 2020 08:55 UTC We're seeing a higher load and longer execution time in EU Snowflake queries. We are investigating the causes. Next update in 60 minutes or as new information becomes available.

10 Jul 2020 09:55 UTC Unfortunately we have no update at the moment. We have added some processing power to the Snowflake warehouse and we are monitoring the situation closely to see if that helps. Job processing should be fine, you may only see slight delays. Next update in 60 minutes or as new information becomes available.

10 Jul 2020 11:10 UTC No update at the moment. Next update in 60 minutes or as new information becomes available.

10 Jul 2020 12:30 UTC No update at the moment. Next update in 90 minutes or as new information becomes available.

10 Jul 2020 14:00 UTC We're in touch with Snowflake support and trying to identify the root cause. Next update in 3 hours or as new information becomes available.

10 Jul 2020 18:00 UTC We're changing certain scaling parameters of the Snowflake warehouse to see if it can help resolve the issue. Next update in 24 hours or as new information becomes available.

11 Jul 2020 18:50 UTC The configuration change helped and we're fully operational. This is the last update of the incident. Thanks for your patience!


]]>
tag:status.keboola.com,2013:Post/1571089 2020-07-09T08:25:47Z 2020-07-09T13:19:18Z Snowflake Slowdown in the EU Region

9 Jul 2020 08:24 UTC We're seeing a higher load and longer execution time in EU Snowflake queries. We are investigating the causes. Next update in 60 minutes or as new information becomes available.

UPDATE 9 Jul 2020 8:54 UTC We have added additional power to the warehouse to help process the queued queries. Currently the situation seems normal, but we're monitoring it closely for next couple hours. Next update in 90 minutes or as new information becomes available.

UPDATE 9 Jul 2020 9:18 UTC After all the additional workload has been processed we have scaled down the cluster, but we're seeing queuing jobs again. We have again scaled up the cluster to help with the load. We're monitoring the situation closely. Next update in 60 minutes or as new information becomes available.

UPDATE 9 Jul 2020 10:15 UTC We're in touch with Snowflake support to resolve this issue. Meanwhile we have decreased the worker capacity so the Storage jobs may be queued on our end. This should take some load off the Snowflake warehouse to maximize it's performance. Next update in 60 minutes or as new information becomes available.

UPDATE 9 Jul 2020 11:45 UTC Snowflake engineering team is resolving the underlying issue. Thanks to the throttling on our end there are currently no delays in jobs processing. Next update in 60 minutes or as new information becomes available.

UPDATE 9 Jul 2020 12:30 UTC Snowflake informed us that the issue was fixed. We're restoring platform parameters to the original values and will continue monitoring the situation. Next update in 60 minutes or as new information becomes available.

UPDATE 9 Jul 2020 13:20 UTC All operations are back to normal and everything is fully working.

We're sorry for the inconvenience and appreciate your patience.

]]>
tag:status.keboola.com,2013:Post/1567948 2020-07-07T09:03:36Z 2020-07-07T09:03:36Z Week in Review - July 7th, 2020

New Components

Updated Components

  • FTP extractor - when an FTP extractor is configured via the UI, the Decompress option automatically adds the keboola.processor-flatten-folders processor to the configuration.

UI Improvements

From now on, you'll be able to see invited users on the dashboard.

]]>
tag:status.keboola.com,2013:Post/1567941 2020-07-07T08:59:36Z 2020-07-07T08:59:36Z Upcoming changes to file expiration in Storage

We are introducing shorter expiration times for files in Storage.

From July 13, newly created files will have the following expiration settings:

- 15 days for table import files and manual uploads
- 48 hours for table export files

Expiration of existing files will not be affected by this change.

]]>
Erik Žigo
tag:status.keboola.com,2013:Post/1565521 2020-06-29T07:17:21Z 2020-06-29T07:17:21Z Week in review -- June 29th, 2020

New Features and Updates

Project Description

Project description is no longer in a read-only mode; you can modify it to fit your needs.

Looker Writer Connection Name

Deprecation of Storage API .NET Client

We decided to deprecate the old and no longer maintained .NET version of the Storage API client. As a replacement we recommend you one of the supported Storage API clients.

Renaming Storage Buckets and Tables

There's a separate post explaining this new feature.

Selecting Bucket in Input Mapping

You can select a whole bucket when adding new tables to Input Mapping. This was originally enabled only for transformations; now you can use this feature for all remaining components.

Bug Fixes

  • Generic Extractor no longer stops after the 2nd page when downloading data in child jobs (only configurations with Limit Stop setting were affected)
  • CSV import component supports full load again (due to a bug, all imports were performed incrementally).
  • MySQL writer no longer writes an "empty string" instead of a null for columns with DATE and DATETIME data types.

New Components

  • CSOB CEB extractor for downloading bank statements from the CSOB CEB Business Connector service
  • Azure Blob Storage writer for exporting any input CSV files into designated Blob containers
  • Sisense writer for sending tables from Keboola Connection to a Sisense database platform
  • Zendesk writer for creating and updating Zendesk Support properties with the flexibility of defining their own parameters
]]>
Vladimír Kriška
tag:status.keboola.com,2013:Post/1564644 2020-06-25T14:47:25Z 2020-06-25T15:02:46Z OAuth Component Authorization errors

We’re currently experiencing OAuth authorization errors returning error:

Docker encryption error: Contact support@keboola.com and attach this exception id

Only authorization of new configuration is affected, running jobs aren't affected. First occurrence of the error was at 14:02 UTC.

We are performing rollback. Next update in 60 minutes or as new information becomes available.

UPDATE 14:42 UTC - Rollback was successfully performed on EU region. Rollback for US region is in progress. Next update in 60 minutes or as new information becomes available.

UPDATE 15:01 UTC - Rollback was performed also on US region. All system are operational.
]]>
tag:status.keboola.com,2013:Post/1560657 2020-06-17T11:08:41Z 2020-06-17T11:33:28Z "Something went wrong" - UI error in OAuth relying components

June 17, 10:26 UTC - After releasing a new version of OAuth broker we have identified that UI of components relying on OAuth authorization is broken, displaying only message: "Something went wrong".

UPDATE 11:01 UTC - We have reverted the release to previous version, which restored the functionality. We are still investigating this issue. It's possible that the problem also affected jobs of these components.

UPDATE 11:31 UTC - We have confirmed that the jobs of the components were unaffected by this bug

]]>
tag:status.keboola.com,2013:Post/1557607 2020-06-11T00:32:48Z 2020-06-11T00:52:20Z Failed attempts to run jobs in US region

Some API calls to run jobs are ending with an application error in the US region. We're investigating the causes.

Update 0:40 UTC: The problem is resolved now. One of the API worker freeze. Running jobs were not affected. We will post a detailed analysis within a week.


]]>
tag:status.keboola.com,2013:Post/1556928 2020-06-10T06:42:17Z 2020-06-12T06:32:56Z Snowflake query incidents

We are investigating Snowflake query failures. Affected queries ends up with similar error message:

Processing aborted due to error 300005:3495968563; incident 9229003

Only minority of projects and queries are affected. We noticed first occurrences on July 8th. We are in touch with Snowflake Support, issue is related to new Snowflake releases and they are investigating it. Next update in 120 minutes or as new information becomes available.

UPDATE 8:20 UTC - Snowflake engineering is working on this issue.

UPDATE 9:35 UTC - Snowflake engineering has already identified the issue, they are testing the changes and also working on rolling out the change. Next update in 120 minutes or as new information becomes available.

UPDATE 10:50 UTC - Fixed version will be released within 24 hours by Snowflake. Next update in 6 hours or as new information becomes available.

UPDATE 17:24 UTC - Snowflake engineering is still working on releasing the patch. The estimate release in 24 hours still holds. Next update in 4 hours or as new information becomes available.

UPDATE 22:10 UTC - According to Snowflake engineering, the the issue is fixed. We're monitoring the situation.

UPDATE June 11, 05:28 UTC - Previously affected queries were executed successfully. Unfortunately few other queries ended up with incident between  Jun 11 02:20:59 - Jun 11 02:25:50 in EU region and one query at Jun 11 01:10:00 UTC in US region. Snowflake engineering is investigating the issue. Next update in 4 hours or as new information becomes available.

UPDATE June 11, 8:39 UTC - Snowflake engineering confirmed that queries which failed tonight were still running on affected release while clusters were still migrating to newer release. At the moment we don't register any failures of queries running on new release. We're monitoring the situation. Next update in 12 hours or as new information becomes available.

UPDATE June 11, 20:52 UTC - There were no query failures since last update. We'll continue to monitor the situation.

UPDATE June 12, 6:30 UTC - There were no query failures since last update. The issue is now resolved. We apologize for the inconvenience. If you have any questions or see any related issues, please contact Keboola Support.

]]>
tag:status.keboola.com,2013:Post/1556084 2020-06-08T10:12:32Z 2020-06-09T07:21:28Z Oracle extractor higher error rate

We are investigating higher error rate of Oracle extractor jobs. Affected jobs ends with error:

DB query failed: Export process failed: Connection error: IO Error: The Network Adapter could not establish the connection Tried 5 times.

Next update in 60 minutes or as new information becomes available.

UPDATE 10:14 UTC: We have preventively rollbacked to previous version of extractor and we are monitoring presence of the failures. Next update in 60 minutes or as new information becomes available.

UPDATE 10:40 UTC: After the rollback all previously affected configurations are running ok. We are investigating what caused the regression in new release and we'll provide details in next three days.

UPDATE June 9, 07:14 UTC: Failures were caused by incorrect connection parameters handling in component. Only configurations using SSH Tunnel were affected. We are working on better test coverage for these cases to avoid the similar issues. We sincerely apologize for the errors.  
]]>
tag:status.keboola.com,2013:Post/1556061 2020-06-08T07:04:27Z 2020-06-08T07:20:15Z MySQL extractor errors

We are investigating MySQL extractor errors for some configurations ending with error message: 

The "incrementalFetchingColumn" must be configured, if incremental fetching is enabled.

The issue is probably caused by new extractor release on Jun 08 05:55 UTC, we are doing rollback.

UPDATE 07:07 UTC: We have rollbacked the previous version. All affected configurations should start working in five minutes.

UPDATE 07:18 UTC: All affected configurations are working again, last error at Jun 08 07:07:45
]]>
tag:status.keboola.com,2013:Post/1554096 2020-06-04T11:57:44Z 2020-06-05T08:15:04Z GoodData Writer failures in EU region

There is some problem on GoodData API since about 08:00 UTC which causes failures on some model updates and data loads.

We are investigating the problem with GoodData support and will keep you updated.

Update 14:20 UTC - The GoodData Technical Support team is still investigating this issue.

Our GoodData writer component has not changed since January, so we are waiting for a clarification of the root cause from their support team.

Update 15:10 UTC - This issue is related to GoodData's release today. They are preparing a hotfix for it and expect it to be deployed in a few hours.

Update June 5, 08:20 UTC - The problem has been resolved. GoodData deployed a hotfix for their API last night. Since June 4 21:00 UTC we have not seen any new errors from the GoodData Writer.

]]>
Erik Žigo
tag:status.keboola.com,2013:Post/1551860 2020-05-30T18:29:03Z 2020-05-30T19:44:06Z Python/R Sandboxes failures

6:28pm UTC: We are experiencing Python/R Sandbox failing to create in EU and US region and keep on investigating the problem. We'll keep you updated.

6:38pm UTC: We have identified the root cause being expired CA intermediate certificate. We proceed to replace the expired certificate for the sandbox instances. Next update within an hour.

7:38pm UTC: We have successfully replaced CA intermediate certificate for the US region and Python/R sandboxes successfully create now. EU region CA intermediate certificate replacement  is on the way. Next update within an hour.

7:43pm UTC: The Python/R Sandboxes failing to create is resolved now. We have replaced the expired CA intermediate certificate and new Python/R sandboxes create successfully now in both US and EU region. We are sorry for the inconvenience.


]]>
tag:status.keboola.com,2013:Post/1549401 2020-05-29T11:12:45Z 2020-06-01T08:12:05Z Renaming Storage Buckets and Tables

An option to rename buckets and tables was one of the most requested features on our wishlist. It is very useful when you want to name your bucket by its contents (e.g., "email-orders") rather than "in.c-keboola-ex-gmail-587163382".

From now on, you'll be able to change the names of buckets and tables.

Rename Bucket

To rename a bucket, navigate to the bucket detail page, and click the pen icon next to the name parameter.

Then choose the name of your preference (there are some limitations though).

Rename Table

To rename a table, navigate to the table detail page, and click the pen icon next to the name parameter.

Then choose the name of your preference (the same limitations apply).

Consequent Changes

Despite the fact that adding the option to rename a bucket or a table does not look like a very big deal, we had to make some substantial changes under the hood. Some of the consequences are worth mentioning here:

Hidden "c-" prefix

We no longer show the "c-" prefix in the names of buckets and tables. It is still a part of the bucket and table ID, but the ID is no longer displayed in most cases. If you need to access the ID for some reason, it is still available on the detail page of each bucket and table.

This is an example of how buckets and tables are displayed without the "c-" prefix:

Stage Selector

When searching for a specific bucket or table, just select a stage and the buckets will be filtered by the selected stage.

]]>
Vladimír Kriška
tag:status.keboola.com,2013:Post/1551285 2020-05-29T07:48:11Z 2020-05-29T07:49:24Z Errors in AWS S3 Extractor

Today since 7:00 UTC until 7:35 UTC we encountered AWS S3 extractor failing jobs with the error:

Invalid cipher text for key #KEBOOLA_USER_AWS_SECRET_KEY Value is not an encrypted value.

We found the root cause of the issue and immediately fixed it. There was no leaked key or secret, only wrong naming in the environment setup. We sincerely apologize for the error.


]]>
tag:status.keboola.com,2013:Post/1546680 2020-05-22T11:01:11Z 2020-05-22T11:01:11Z Week in Review - May 22th, 2020

Transformations

  • Python/R and Julia transformations have their default RAM limit increased to 16 GB. This applies to sandboxes as well.

Snowflake Platform Update

  • Distinct keyword will be disallowed in an ordered window function or a window frame. The full post can be found here.

New Components

  • OneDrive Excel Sheets extractor - extracts from a OneDrive account or from a SharePoint account.
  • OneDrive Excel Sheets writer - writes to a OneDrive account or to a SharePoint account.
  • Zoom Webinar Registrator - obtains a list of people to be registered for a specific Zoom Webinar and processes the registration.

Updated Components

  • Database writer - many databases writers now support config row configurations. The full post can be found here.
  • AWS S3 extractor now uses the ListObjectV2 method for listing objects in a bucket. This improves the performance for versioned buckets.
  • MongoDB extractor - added support for incremental fetching

Security Improvements

  • TLS security update: as of May 12, 2020, Transport Layer Security (TLS) 1.0 and 1.1 are no longer supported for securing connections to Keboola Connection endpoints. More information can be found here.

UI Improvements

  • In "generic UI" components, the documentation and configuration parts were split into separate boxes
  • MongoDB has a new detail layout to match other database extractors.

Developer Portal UI Improvements

  • You are able to preview and validate the form created by the configuration schema.
]]>
tag:status.keboola.com,2013:Post/1547128 2020-05-22T08:21:37Z 2020-05-22T08:21:37Z Database Writers with Configuration Rows support

We're happy to announce the arrival of Configuration Rows, our new powerful configuration format, to database writers.

From now on, you'll see a migration button in the configuration detail of each database writer (Snowflake, MySQL, SQL Server, Oracle, PostgreSQL, Impala, Hive, and Redshift).

Just click Migrate Configuration and the configuration will be migrated to the new format.

After the migration, you'll see more information about each table. All tables can be easily reordered, so you can move more important tables to the top and they will be uploaded first.

Also, you will be able to see information about each table on a new table detail page, with Last runs and Versions in a sidebar.

Underlying Important Changes

While there were certain limitations in the old configuration format, this is no longer true in the new "rows format".

The following features are worth mentioning:

  • Disabled tables will no longer be exported from Storage (previously, they were exported with limit=1 and not used in the database writer).
  • Each table has its own state with information about the date/time of the last import (previously, an upload of a single table cleared the state for other tables).


]]>
Vladimír Kriška
tag:status.keboola.com,2013:Post/1547562 2020-05-22T07:42:57Z 2020-05-22T07:42:57Z Postmortem: MySQL Extractor errors

Summary

Original post https://status.keboola.com/mysql-extractor-errors

On May 11th 2020 10:12 UTC we have released a new version (5.5.1) of MySQL extractor in which a bug was present.
It caused errors in the UI:

Decoding JSON response from component failed: Syntax error


It also affected jobs of this extractor. Although the jobs seemed to finish successfully, they didn't process any data.
The flawed version was released at 10:12 UTC and reverted at 13:09 UTC.

Unfortunately another version (5.5.2) deployed on May 12th 7:25 UTC contained another bug, which affected certain queries, resulting in error message:
DB query failed: Trying to access array offset on value of type null

We have reverted this release on May 13th 11:26. It affected about 6% of all jobs of this component.
We sincerely apologize for the errors. 

What Happened?

The cause of the first problem was missing command in Dockerfile. This was fixed in release 5.5.2.

The second error was introduced with an update of PHP version. It was fixed in latest release (5.5.3).

What Are We Doing About This?

We have added tests to this component to cover these cases. 

]]>
tag:status.keboola.com,2013:Post/1545858 2020-05-19T18:31:46Z 2020-05-19T18:31:47Z Postmortem: Degraded Snowflake Performance & Failed Jobs

Summary

In the past months, we have been having a number of problems with the Snowflake backend in both the US and the EU region. These were caused by a number of loosely connected issues. For that reason, we've decided to publish a joint post-mortem. 

First, we've seen some rarely failing queries (that's about one failed query in a million) during January. It was unclear exactly what was causing the randomly appearing errors, so we kept investigating. On January 6th, we saw a high increase of this error type (hundreds in a million). We moved to debug this with Snowflake. From this point, we've seen a steady increase of the errors interlaid with days when it didn't occur at all. This made debugging the root cause quite challenging on both our own and the Snowflake side.

In our attempt to resolve the issue, we tried updating the ODBC drivers to different versions as advised. Unfortunately the new drivers suffered from regression issues (now fixed under the reference SNOW-148261 and SNOW-150687). This led to even more errors. On February 27th, Snowflake engineering found that the problem is actually related to the Snowflake Cloud Service Layer and the number of roles we have in our account. In their attempt to resolve the issue, they introduced changes to their service. On March 3rd, this resolved the issues of failing queries but caused slowdowns in the Cloud Service Layer. We suffered particularly severe slowdown on March 3rd (EU) and then milder slowdowns on April 8th (EU) and April 23rd (EU). All of these had the same root cause. We were hit by another slowdown on April 24th in the US and on May 7th in the EU. The last two had somewhat different root causes, and are described below in more detail.

What Happened?

The obvious questions that everyone asks (including us) are "Whose fault is this?"; "What's wrong with Keboola Connection?"; "Were we the only customer affected?" It turns out that the unfortunate events were caused by a conjunction of multiple causes. 

At the base of the pyramid is how we use Snowflake. Our usage pattern is in many aspects atypical. However, this is what we need for the great features of Keboola Connection like reliability, repeatability and auditability. The two most important characteristics (for the course of past events) are a high number of queries and frequent changes to the database roles. These two characteristics produce very high load on the Snowflake Cloud Service Layer (CSL), which is responsible for processing every query and figuring out its permissions. This unusual load for the Snowflake database puts strain on unexpected parts of the Snowflake environment and at some points we're pushing the limits.

The problem is far more complex, though. The load in terms of number of queries is one thing, but the load it creates on the Snowflake CSL is proportional to the complexity of permissions with which it interacts. It is therefore the combination of factors – the number of queries, the roles they use, the types of query and the state of the environment (load from other users, fraction of queries going to the warehouse, latency on CSL, queuing of queries and overhead associated with queuing) – that creates the mix. This is the reason why some projects have been more affected than others. Projects running large numbers of small jobs cause disproportionally higher load (more queries, more permission manipulations, fewer actual computations). Also, they are more affected because even small delays are noticeable in short jobs.

This explains why we were seemingly the only customer affected. We were not. When the queries were failing, it was one in a million at the beginning, and one in ten thousand at its worst. This kind of error rate is completely unnoticeable except in a highly automated and audited environment. That explains why an end-user can be querying the same Snowflake warehouse from a Looker or Tableau dashboard and see no problem and yet at the same time see failed jobs in Keboola Connection. This also applies to the later slowdown incidents. For example, we had a situation when all DCL queries took over 500 milliseconds instead of the usual 100 milliseconds. This is hardly noticeable by most customers, but it has a huge impact on the speed of Keboola Connection jobs, especially on the short ones. These are also the reasons why the incidents are not mentioned on Snowflake's status page. While they were not limited only to us, the impact on most of the other customers was not large enough to cross the necessary threshold. 

The multiple regressions with the ODBC drivers also affected mainly us because we upgraded them hastily as soon as the upgrades were published in an attempt to resolve the original issue. While we were not the only customer using them, we ran high millions of queries through them within a few days. Customers not suffering from the CSL problems kept using the older drivers and were not impacted by this regression.

In more technical detail, a number of operations contributed to the incident. When we do an operation on Storage, we have to establish a connection with the Snowflake database. The database needs to evaluate the permissions of the connecting role. This is done in the Snowflake CSL. It takes care of processing queries which are not operating on data (DDL + DCL) and are not using the warehouse. 

When the issue first appeared, the CSL was dropping queries when it ran out of resources. The cause of this is that we have a complex permission system which we're changing often, thus invalidating a cache on which the CSL performance relies. Nearly every connection therefore needs to reevaluate the permission tree of the connecting role. When Snowflake fixed this so that the queries were not dropped, another problem emerged. The simple fact that some DCL queries took nearly seconds instead of milliseconds caused serious slowdowns of job processing. The slowdown of each query was proportional to the size of the permission settings (number of Storage workspaces mainly) and the amount of traffic in the project. At some point, the slowdown was so intense, that queries were waiting seconds to be received by the CSL.

The CSL also prepares the queries for each warehouse. Our application is "CSL intensive," which means that we are affected by even small performance degradations of the CSL (even if they are barely noticeable for other Snowflake customers). This is what happened in the last two incidents described below. 

Apart from all this, we were also hit by a number of smaller issues (e.g. login failure) which are completely unrelated – they were just strokes of bad luck.

US Incident on April 24th

On 2020-04-24 at 9:12 UTC, we noticed reduced performance of a Snowflake warehouse in our US region and opened a ticket with Snowflake. At 12:00 UTC, the US warehouse started queuing queries at the usual Friday peak time. What seemed like normal peak time, which lasts a couple of hours, turned into an overloaded warehouse where queries were executing slower and slower. Multiple attempts to scale up the warehouse didn't help so we escalated the ticket on Snowflake. We had to stop executing jobs to pause the load to the warehouse to give it time to recover. Snowflake engineering then boosted the resources in their CSL to avoid repeating the issue.

Multiple factors contributed to the incident. The performance of the CSL was worse than usual that day, which was noticeable, but it was not enough to trigger an alarm on the Snowflake side. This was combined with slightly higher load from our side and the fact that the Snowflake CSL cannot be scaled by boosting the warehouse. At some point, the warehouse reached the situation where so many queries were queued that the CSL spent more time requeuing the queries than actually executing them.

EU Incident on May 5th

On 2020-05-08 at 8:30 UTC, we noticed reduced performance of a Snowflake warehouse in our EU region. Since we had already encountered a similar issue in the US region, we immediately took steps to reduce the load and avoid overloading the warehouse in the first place. This led to longer waiting times in jobs, but it allowed us to execute jobs during the whole incident. We raised the issue with Snowflake and, once they'd discovered the root cause, they applied a fix that resolved the issue. The root cause was uneven distribution of queries in the CSL, which led to an overload and subsequent crash of the underlying machine. With the uneven distribution bug in place, there was not enough computational power in the CSL part allocated to us. While the root cause is different from that of the US incident, the symptoms were the same and so were the reasons why this wasn't a platform-wide incident on Snowflake.

What Are We Doing About This?

First, we're working intensively with Snowflake. During the past few months, both we and Snowflake have learned to measure, detect and ideally avoid this kind of incident. We have both improved our processes for handling CSL issues. While it took more than a month to resolve the first problem, it took us only two hours to resolve the last incident. We both went down the long path of discovering, debugging and untangling a complex issue and we both gained valuable knowledge, albeit at a high price.

We're engaged in discussions with Snowflake engineering in order to better understand the implications of each other's design decisions. We have learned a lot about what limits we are nearing and what can be done about them. Snowflake engineering understands our usage pattern and is taking steps to keep the CSL more stable. We understand what internal limits we're nearing and what we should do to avoid exceeding them. In the long term, we're working on adjusting our design and usage patterns to better match how Snowflake is set up. We will do it without modifying the way Keboola Connection works for you. In the short term, we've updated our maintenance procedures to be able to detect these issues earlier and then to act more quickly, should something similar reoccur. In the short term, Snowflake have added additional resources to the Snowflake CSL and improved monitoring to prevent these issues from occurring again. In the long term, Snowflake are aiming to make the Cloud Service Layer more scalable. 

We've already taken a number of small steps; specifically:

  • We found a bug in a transformation service that caused some roles to be left over. This is already fixed and the number of unused workspaces is slowly decreasing.

  • We'll proceed to clear the rest of the unused database roles in a one-time cleanup. This, along with the previous step, should improve CSL performance on the most affected projects.

  • We've agreed with Snowflake about changes to make to the ODBC driver management to minimize the impact of any future regressions.

  • We're currently checking whether we can implement changes to our usage pattern, as suggested by Snowflake. 

To be absolutely honest, we can't say that the problem is solved, but we now understand the causes and how to mitigate them. There is still a lot of technical work ahead of us. However, we are confident that, if the incidents repeat, we can manage them with less and less impact until they are not noticeable to you. We're really sorry that we haven't delivered the performance you are used to recently. We have all hands on deck, though, to prepare and deliver a permanent fix as soon as possible. In the current hard times, patience is scarce, but we hope you will be patient with us for a bit longer as we tackle the work needed. 


]]>
tag:status.keboola.com,2013:Post/1542959 2020-05-11T20:27:31Z 2020-05-11T20:27:31Z Postmortem: Incident with Snowflake in the US Region

Summary

On April 14 between 19:58 and 21:23 UTC, the US Snowflake backend became unavailable. All jobs working with a Snowflake database failed with an internal error. Logging into workspaces was not possible either.

What Happened?

On April 14, Snowflake created a new release with an issue in the authentication process. This resulted in the inability to create a new database session for the affected accounts. The release was deployed gradually, which is the reason why only some accounts were affected. The release was rolled back by Snowflake.

What Are We Doing About This?

We are terribly sorry, but we can't really do anything. This is out of our hands.

Detailed explanation from Snowflake

When a user tries to authenticate, the Snowflake cloud service layer creates a session object that lists all the roles for the user. As this amounted to a large number in the Keboola account, it exposed a resource leak in our 4.12 release that resulted in users not being able to log in.

Other customers were not impacted as their role hierarchy did not trigger the same code path.

As an immediate remediation, Snowflake rolled back the affected release and disabled the code path, which was protected by a parameter.

As part of the post-mortem, a test was added to our test suite that better captures this role configuration. Additionally, logging was put in place to make detection of this type of corner case easier to diagnose.


]]>
tag:status.keboola.com,2013:Post/1542984 2020-05-11T15:01:45Z 2020-05-13T10:21:10Z MySQL Extractor errors

Today we have released a new version of MySQL extractor in which a bug was present.

It caused errors in the UI:

Decoding JSON response from component failed: Syntax error

It also affected jobs of this extractor. Although the jobs seemed to finish successfully, they didn't process any data.

The flawed version was released at 12:14 and reverted at 15:09 CET.


UPDATE:

Another version deployed on May 12th 9:25 introduced another bug, which affected certain queries, resulting in error message:

DB query failed: Trying to access array offset on value of type null

We have reverted this release today on May 13th 11:26.

We sincerely apologize for the errors. A postmortem reports will follow with further details.

]]>
tag:status.keboola.com,2013:Post/1540962 2020-05-07T08:36:24Z 2020-05-07T22:45:57Z Snowflake Slowdown in the EU Region
7 May 2020 8:30 UTC We're seeing a higher load and longer execution time in EU Snowflake queries. We have added more compute capacity and investigating the causes. Next update in two hours.

7 May 2020 9:50 UTC The performance should be back to normal, we're monitoring the situation.

7 May 2020 11:00 UTC We're seeing again slower execution, we're working with Snowflake on resolving the issue. Next update in two hours.

7 May 2020 12:13 UTC Snowflake engineering identified the cause of the reduced performance, we're now processing the backlog. There are still some queued orchestrations, but the run times of individual jobs are back to normal. Both us and Snowflake engineering are monitoring the load. Next update in two hours.

The incident is resolved.


]]>
tag:status.keboola.com,2013:Post/1537318 2020-05-01T09:01:34Z 2020-05-01T09:01:34Z Weeks in review -- April 2020

New Changes in the UI

  • Transformation script editing can now be done in fullscreen mode.

Normal mode:

Fullscreen mode


  • Database writers now have newly improved input mappings


  • The shared bucket detail now shows who shared it (if applicable)


  • And the sandbox modals have been cleaned up:

New Components:

  •  Active Campaign :  Use this component to gather information on your campaigns from your Active Campaign account.


Updated Components:

  • MySQL extractor now properly handles utf8mb4 emojis 
  • Data Warehouse Manager now allows password reset for schema users
]]>
tag:status.keboola.com,2013:Post/1533459 2020-04-29T15:32:56Z 2020-04-29T15:32:56Z TLS Security Update

As of May 12, 2020, Transport Layer Security (TLS) 1.0 and 1.1 will no longer be supported for securing connections to Keboola Connection endpoints.

The vast majority of HTTPS connections made to KBC endpoints use TLS 1.2 and will not be affected. This includes every currently shipping browser used by KBC users. 

We have separately contacted all affected projects. If you did not hear from us then no action is required

If you have any questions or concerns related to this announcement, please don’t hesitate to contact us.

]]>
tag:status.keboola.com,2013:Post/1535012 2020-04-24T14:44:36Z 2020-04-24T22:32:08Z Snowflake Slowdown in the US Region
Friday, 24 April 2020 14:42 UTC We're seeing a higher load and longer execution time in US Snowflake queries. We have added more compute capacity and investigating the causes. Next update in two hours.

Update 18:16 UTC: We're still seeing degraded performance in Snowflake in US region and we're investigating with Snowflake support. Next update in 2 hours.

Update 20:22 UTC: We are working with snowflake on reducing the queueing in our warehouse. We had to pause jobs execution at 20:00 UTC to reduce the influx of queries. When the queue is worked through we'll reenable the jobs.

Update 20:51 UTC: We reenabled the paused job queue with limited throughput and we're monitoring the Snowflake queue closely. So far we see no queueing. Next update in 2 hours. 

Update 22:21 UTC: Job queue is running at full capacity and there are no queries waiting in Snowflake warehouse. Preliminary analysis suggests that the issue was probably caused by a congestion in Snowflake's Cloud Service Layer, but it took Snowflake team some time to find out the root cause and fix it. Some jobs were delayed and some queries timed out resulting in job failures. Those jobs will need to be restarted. We're sorry for the problems this might have caused.

]]>
tag:status.keboola.com,2013:Post/1533342 2020-04-20T07:39:27Z 2020-04-22T15:00:57Z Snowflake Slowdown in EU

Monday, 20 April 2020 07:39:02 UTC: We're seeing degraded performance of Snowflake in EU region, we're investigating the cause with Snowflake. Next update in 1 hour.

Update 08:17:25 UTC: We have thrown more computing power in and the average running times are back to normal. We're still seeing some occasional isolated queries that take longer. We're still working with Snowflake on identifying and resolving the issue, but Keboola Connection is stable now. Next update in 4 hours.

Update 11:31:30 UTC: We still observe slight slowdown in some queries, while other queries run smoothly. From our analytics it seems that job run times are not affected as we've offset the slowdown with more computing power. Next update in 4 hours.

Update 15:33:10 UTC: No significant changes, the situation is stable, but not resolved. Snowflake is working on identifying the source of the performance issues. We're monitoring the situation and in case of significant slow downs we'll offset with more computational power. Next update tomorrow or earlier if there are any changes.

Update 21 April 2020: The situation is stable, we're working with Snowflake on maintaining the stability.

Update 22 April 2020: Snowflake engineers improved performance of impacted queries, together we're working on preventing this in future. We consider the incident closed. Postmortem will be published when we the root cause is fully understood.

]]>
tag:status.keboola.com,2013:Post/1532719 2020-04-18T11:20:31Z 2020-04-18T11:20:31Z Snowflake Job Delays in the US Region

In the early morning Snowflake had an incident in their US West region which caused a large backlog of job processing in Keboola's US Region.  The jobs were all eventually processed, but they may have taken much longer than what you normally experience.

The buildup in our queue began just before 2:00AM CEST and started to ease after 4:30AM  CEST.

Please refer to the above link for further information, and we will add a link to the RCA when it becomes available.

]]>
tag:status.keboola.com,2013:Post/1531872 2020-04-16T11:52:06Z 2020-04-16T11:52:07Z Transformation failures - Post-Mortem

Summary

Between March 30, 20:58 UTC and March 31, 6:15 UTC, some transformation jobs failed with an internal error. About 2% of all transformation jobs were affected. We sincerely apologize for this incident.

What Happened?

On March 30 at 20:58 UTC, we deployed a new version of the Transformation service which contained updated Snowflake ODBC drivers. The update was enforced by Snowflake as a security update patch. Unfortunately, the new version of the driver contained a critical bug which caused the driver to crash when some queries were running longer than one hour. This led to failed transformation jobs.

What Are We Doing About This?

We now treat all driver updates as major updates. This means they go through more careful deployment and monitoring so that we can detect possible problems faster. In the long term, we're working with Snowflake to update drivers in a more controlled manner.


]]>
tag:status.keboola.com,2013:Post/1531239 2020-04-14T20:16:32Z 2020-04-14T21:55:13Z Incident with Snowflake in the US Region

We are currently investigating an increased error rate from snowflake in the US region from approximately 10:00PM CEST.

We will update here as soon as we know more.

UPDATE 11:05 PM CEST: We are handling the issue with Snowflake support. So far all Snowflake operations in US region seem to be failing. Next update at 11:30 PM or sooner if there are any new information or situation changes.

UPDATE 11:30 PM CEST: Snowflake rolled back the release they made today and everything has returned to working condition.

UPDATE 12:00 PM CEST: We're very sorry for this inconvenience. The error started at 12:58 PST (19:58 PM UTC) and lasted until 14:24 PST (21:24 PM UTC). All new Snowflake connections in the US (including those from your DB clients) were failing during the period.

Unfortunately you will need to restart any failed jobs or orchestrations from this time period.

EU region was not affected by this issue.

]]>
tag:status.keboola.com,2013:Post/1528993 2020-04-09T08:49:32Z 2020-04-09T08:49:33Z Snowflake Slowdown in EU

A scaling script running at 12:00 AM CEST failed to scale up the Snowflake warehouse in EU region. All storage and transformation jobs in the EU were affected by this issue and were significantly slower than usual. 

To help process the queued load we have scaled up the warehouse at 9:45 AM CEST and will keep it running until all load is processed.

We're sorry for this inconvenience and we'll be implementing safeguards to prevent this from happening again. 

]]>
Ondrej Hlavacek