Oracle extractor higher error rate

We are investigating higher error rate of Oracle extractor jobs. Affected jobs ends with error:

DB query failed: Export process failed: Connection error: IO Error: The Network Adapter could not establish the connection Tried 5 times.

Next update in 60 minutes or as new information becomes available.

UPDATE 10:14 UTC: We have preventively rollbacked to previous version of extractor and we are monitoring presence of the failures. Next update in 60 minutes or as new information becomes available.

UPDATE 10:40 UTC: After the rollback all previously affected configurations are running ok. We are investigating what caused the regression in new release and we'll provide details in next three days.

UPDATE June 9, 07:14 UTC: Failures were caused by incorrect connection parameters handling in component. Only configurations using SSH Tunnel were affected. We are working on better test coverage for these cases to avoid the similar issues. We sincerely apologize for the errors.  

MySQL extractor errors

We are investigating MySQL extractor errors for some configurations ending with error message: 

The "incrementalFetchingColumn" must be configured, if incremental fetching is enabled.

The issue is probably caused by new extractor release on Jun 08 05:55 UTC, we are doing rollback.

UPDATE 07:07 UTC: We have rollbacked the previous version. All affected configurations should start working in five minutes.

UPDATE 07:18 UTC: All affected configurations are working again, last error at Jun 08 07:07:45

Python/R Sandboxes failures

6:28pm UTC: We are experiencing Python/R Sandbox failing to create in EU and US region and keep on investigating the problem. We'll keep you updated.

6:38pm UTC: We have identified the root cause being expired CA intermediate certificate. We proceed to replace the expired certificate for the sandbox instances. Next update within an hour.

7:38pm UTC: We have successfully replaced CA intermediate certificate for the US region and Python/R sandboxes successfully create now. EU region CA intermediate certificate replacement  is on the way. Next update within an hour.

7:43pm UTC: The Python/R Sandboxes failing to create is resolved now. We have replaced the expired CA intermediate certificate and new Python/R sandboxes create successfully now in both US and EU region. We are sorry for the inconvenience.


Errors in AWS S3 Extractor

Today since 7:00 UTC until 7:35 UTC we encountered AWS S3 extractor failing jobs with the error:

Invalid cipher text for key #KEBOOLA_USER_AWS_SECRET_KEY Value is not an encrypted value.

We found the root cause of the issue and immediately fixed it. There was no leaked key or secret, only wrong naming in the environment setup. We sincerely apologize for the error.


Week in Review - May 22th, 2020

Transformations

  • Python/R and Julia transformations have their default RAM limit increased to 16 GB. This applies to sandboxes as well.

Snowflake Platform Update

  • Distinct keyword will be disallowed in an ordered window function or a window frame. The full post can be found here.

New Components

  • OneDrive Excel Sheets extractor - extracts from a OneDrive account or from a SharePoint account.
  • OneDrive Excel Sheets writer - writes to a OneDrive account or to a SharePoint account.
  • Zoom Webinar Registrator - obtains a list of people to be registered for a specific Zoom Webinar and processes the registration.

Updated Components

  • Database writer - many databases writers now support config row configurations. The full post can be found here.
  • AWS S3 extractor now uses the ListObjectV2 method for listing objects in a bucket. This improves the performance for versioned buckets.
  • MongoDB extractor - added support for incremental fetching

Security Improvements

  • TLS security update: as of May 12, 2020, Transport Layer Security (TLS) 1.0 and 1.1 are no longer supported for securing connections to Keboola Connection endpoints. More information can be found here.

UI Improvements

  • In "generic UI" components, the documentation and configuration parts were split into separate boxes
  • MongoDB has a new detail layout to match other database extractors.

Developer Portal UI Improvements

  • You are able to preview and validate the form created by the configuration schema.

Postmortem: MySQL Extractor errors

Summary

Original post https://status.keboola.com/mysql-extractor-errors

On May 11th 2020 10:12 UTC we have released a new version (5.5.1) of MySQL extractor in which a bug was present.
It caused errors in the UI:

Decoding JSON response from component failed: Syntax error


It also affected jobs of this extractor. Although the jobs seemed to finish successfully, they didn't process any data.
The flawed version was released at 10:12 UTC and reverted at 13:09 UTC.

Unfortunately another version (5.5.2) deployed on May 12th 7:25 UTC contained another bug, which affected certain queries, resulting in error message:
DB query failed: Trying to access array offset on value of type null

We have reverted this release on May 13th 11:26. It affected about 6% of all jobs of this component.
We sincerely apologize for the errors. 

What Happened?

The cause of the first problem was missing command in Dockerfile. This was fixed in release 5.5.2.

The second error was introduced with an update of PHP version. It was fixed in latest release (5.5.3).

What Are We Doing About This?

We have added tests to this component to cover these cases. 

Postmortem: Degraded Snowflake Performance & Failed Jobs

Summary

In the past months, we have been having a number of problems with the Snowflake backend in both the US and the EU region. These were caused by a number of loosely connected issues. For that reason, we've decided to publish a joint post-mortem. 

First, we've seen some rarely failing queries (that's about one failed query in a million) during January. It was unclear exactly what was causing the randomly appearing errors, so we kept investigating. On January 6th, we saw a high increase of this error type (hundreds in a million). We moved to debug this with Snowflake. From this point, we've seen a steady increase of the errors interlaid with days when it didn't occur at all. This made debugging the root cause quite challenging on both our own and the Snowflake side.

In our attempt to resolve the issue, we tried updating the ODBC drivers to different versions as advised. Unfortunately the new drivers suffered from regression issues (now fixed under the reference SNOW-148261 and SNOW-150687). This led to even more errors. On February 27th, Snowflake engineering found that the problem is actually related to the Snowflake Cloud Service Layer and the number of roles we have in our account. In their attempt to resolve the issue, they introduced changes to their service. On March 3rd, this resolved the issues of failing queries but caused slowdowns in the Cloud Service Layer. We suffered particularly severe slowdown on March 3rd (EU) and then milder slowdowns on April 8th (EU) and April 23rd (EU). All of these had the same root cause. We were hit by another slowdown on April 24th in the US and on May 7th in the EU. The last two had somewhat different root causes, and are described below in more detail.

What Happened?

The obvious questions that everyone asks (including us) are "Whose fault is this?"; "What's wrong with Keboola Connection?"; "Were we the only customer affected?" It turns out that the unfortunate events were caused by a conjunction of multiple causes. 

At the base of the pyramid is how we use Snowflake. Our usage pattern is in many aspects atypical. However, this is what we need for the great features of Keboola Connection like reliability, repeatability and auditability. The two most important characteristics (for the course of past events) are a high number of queries and frequent changes to the database roles. These two characteristics produce very high load on the Snowflake Cloud Service Layer (CSL), which is responsible for processing every query and figuring out its permissions. This unusual load for the Snowflake database puts strain on unexpected parts of the Snowflake environment and at some points we're pushing the limits.

The problem is far more complex, though. The load in terms of number of queries is one thing, but the load it creates on the Snowflake CSL is proportional to the complexity of permissions with which it interacts. It is therefore the combination of factors – the number of queries, the roles they use, the types of query and the state of the environment (load from other users, fraction of queries going to the warehouse, latency on CSL, queuing of queries and overhead associated with queuing) – that creates the mix. This is the reason why some projects have been more affected than others. Projects running large numbers of small jobs cause disproportionally higher load (more queries, more permission manipulations, fewer actual computations). Also, they are more affected because even small delays are noticeable in short jobs.

This explains why we were seemingly the only customer affected. We were not. When the queries were failing, it was one in a million at the beginning, and one in ten thousand at its worst. This kind of error rate is completely unnoticeable except in a highly automated and audited environment. That explains why an end-user can be querying the same Snowflake warehouse from a Looker or Tableau dashboard and see no problem and yet at the same time see failed jobs in Keboola Connection. This also applies to the later slowdown incidents. For example, we had a situation when all DCL queries took over 500 milliseconds instead of the usual 100 milliseconds. This is hardly noticeable by most customers, but it has a huge impact on the speed of Keboola Connection jobs, especially on the short ones. These are also the reasons why the incidents are not mentioned on Snowflake's status page. While they were not limited only to us, the impact on most of the other customers was not large enough to cross the necessary threshold. 

The multiple regressions with the ODBC drivers also affected mainly us because we upgraded them hastily as soon as the upgrades were published in an attempt to resolve the original issue. While we were not the only customer using them, we ran high millions of queries through them within a few days. Customers not suffering from the CSL problems kept using the older drivers and were not impacted by this regression.

In more technical detail, a number of operations contributed to the incident. When we do an operation on Storage, we have to establish a connection with the Snowflake database. The database needs to evaluate the permissions of the connecting role. This is done in the Snowflake CSL. It takes care of processing queries which are not operating on data (DDL + DCL) and are not using the warehouse. 

When the issue first appeared, the CSL was dropping queries when it ran out of resources. The cause of this is that we have a complex permission system which we're changing often, thus invalidating a cache on which the CSL performance relies. Nearly every connection therefore needs to reevaluate the permission tree of the connecting role. When Snowflake fixed this so that the queries were not dropped, another problem emerged. The simple fact that some DCL queries took nearly seconds instead of milliseconds caused serious slowdowns of job processing. The slowdown of each query was proportional to the size of the permission settings (number of Storage workspaces mainly) and the amount of traffic in the project. At some point, the slowdown was so intense, that queries were waiting seconds to be received by the CSL.

The CSL also prepares the queries for each warehouse. Our application is "CSL intensive," which means that we are affected by even small performance degradations of the CSL (even if they are barely noticeable for other Snowflake customers). This is what happened in the last two incidents described below. 

Apart from all this, we were also hit by a number of smaller issues (e.g. login failure) which are completely unrelated – they were just strokes of bad luck.

US Incident on April 24th

On 2020-04-24 at 9:12 UTC, we noticed reduced performance of a Snowflake warehouse in our US region and opened a ticket with Snowflake. At 12:00 UTC, the US warehouse started queuing queries at the usual Friday peak time. What seemed like normal peak time, which lasts a couple of hours, turned into an overloaded warehouse where queries were executing slower and slower. Multiple attempts to scale up the warehouse didn't help so we escalated the ticket on Snowflake. We had to stop executing jobs to pause the load to the warehouse to give it time to recover. Snowflake engineering then boosted the resources in their CSL to avoid repeating the issue.

Multiple factors contributed to the incident. The performance of the CSL was worse than usual that day, which was noticeable, but it was not enough to trigger an alarm on the Snowflake side. This was combined with slightly higher load from our side and the fact that the Snowflake CSL cannot be scaled by boosting the warehouse. At some point, the warehouse reached the situation where so many queries were queued that the CSL spent more time requeuing the queries than actually executing them.

EU Incident on May 5th

On 2020-05-08 at 8:30 UTC, we noticed reduced performance of a Snowflake warehouse in our EU region. Since we had already encountered a similar issue in the US region, we immediately took steps to reduce the load and avoid overloading the warehouse in the first place. This led to longer waiting times in jobs, but it allowed us to execute jobs during the whole incident. We raised the issue with Snowflake and, once they'd discovered the root cause, they applied a fix that resolved the issue. The root cause was uneven distribution of queries in the CSL, which led to an overload and subsequent crash of the underlying machine. With the uneven distribution bug in place, there was not enough computational power in the CSL part allocated to us. While the root cause is different from that of the US incident, the symptoms were the same and so were the reasons why this wasn't a platform-wide incident on Snowflake.

What Are We Doing About This?

First, we're working intensively with Snowflake. During the past few months, both we and Snowflake have learned to measure, detect and ideally avoid this kind of incident. We have both improved our processes for handling CSL issues. While it took more than a month to resolve the first problem, it took us only two hours to resolve the last incident. We both went down the long path of discovering, debugging and untangling a complex issue and we both gained valuable knowledge, albeit at a high price.

We're engaged in discussions with Snowflake engineering in order to better understand the implications of each other's design decisions. We have learned a lot about what limits we are nearing and what can be done about them. Snowflake engineering understands our usage pattern and is taking steps to keep the CSL more stable. We understand what internal limits we're nearing and what we should do to avoid exceeding them. In the long term, we're working on adjusting our design and usage patterns to better match how Snowflake is set up. We will do it without modifying the way Keboola Connection works for you. In the short term, we've updated our maintenance procedures to be able to detect these issues earlier and then to act more quickly, should something similar reoccur. In the short term, Snowflake have added additional resources to the Snowflake CSL and improved monitoring to prevent these issues from occurring again. In the long term, Snowflake are aiming to make the Cloud Service Layer more scalable. 

We've already taken a number of small steps; specifically:

  • We found a bug in a transformation service that caused some roles to be left over. This is already fixed and the number of unused workspaces is slowly decreasing.

  • We'll proceed to clear the rest of the unused database roles in a one-time cleanup. This, along with the previous step, should improve CSL performance on the most affected projects.

  • We've agreed with Snowflake about changes to make to the ODBC driver management to minimize the impact of any future regressions.

  • We're currently checking whether we can implement changes to our usage pattern, as suggested by Snowflake. 

To be absolutely honest, we can't say that the problem is solved, but we now understand the causes and how to mitigate them. There is still a lot of technical work ahead of us. However, we are confident that, if the incidents repeat, we can manage them with less and less impact until they are not noticeable to you. We're really sorry that we haven't delivered the performance you are used to recently. We have all hands on deck, though, to prepare and deliver a permanent fix as soon as possible. In the current hard times, patience is scarce, but we hope you will be patient with us for a bit longer as we tackle the work needed. 


Postmortem: Incident with Snowflake in the US Region

Summary

On April 14 between 19:58 and 21:23 UTC, the US Snowflake backend became unavailable. All jobs working with a Snowflake database failed with an internal error. Logging into workspaces was not possible either.

What Happened?

On April 14, Snowflake created a new release with an issue in the authentication process. This resulted in the inability to create a new database session for the affected accounts. The release was deployed gradually, which is the reason why only some accounts were affected. The release was rolled back by Snowflake.

What Are We Doing About This?

We are terribly sorry, but we can't really do anything. This is out of our hands.

Detailed explanation from Snowflake

When a user tries to authenticate, the Snowflake cloud service layer creates a session object that lists all the roles for the user. As this amounted to a large number in the Keboola account, it exposed a resource leak in our 4.12 release that resulted in users not being able to log in.

Other customers were not impacted as their role hierarchy did not trigger the same code path.

As an immediate remediation, Snowflake rolled back the affected release and disabled the code path, which was protected by a parameter.

As part of the post-mortem, a test was added to our test suite that better captures this role configuration. Additionally, logging was put in place to make detection of this type of corner case easier to diagnose.


MySQL Extractor errors

Today we have released a new version of MySQL extractor in which a bug was present.

It caused errors in the UI:

Decoding JSON response from component failed: Syntax error

It also affected jobs of this extractor. Although the jobs seemed to finish successfully, they didn't process any data.

The flawed version was released at 12:14 and reverted at 15:09 CET.


UPDATE:

Another version deployed on May 12th 9:25 introduced another bug, which affected certain queries, resulting in error message:

DB query failed: Trying to access array offset on value of type null

We have reverted this release today on May 13th 11:26.

We sincerely apologize for the errors. A postmortem reports will follow with further details.

Snowflake Slowdown in the EU Region

7 May 2020 8:30 UTC We're seeing a higher load and longer execution time in EU Snowflake queries. We have added more compute capacity and investigating the causes. Next update in two hours.

7 May 2020 9:50 UTC The performance should be back to normal, we're monitoring the situation.

7 May 2020 11:00 UTC We're seeing again slower execution, we're working with Snowflake on resolving the issue. Next update in two hours.

7 May 2020 12:13 UTC Snowflake engineering identified the cause of the reduced performance, we're now processing the backlog. There are still some queued orchestrations, but the run times of individual jobs are back to normal. Both us and Snowflake engineering are monitoring the load. Next update in two hours.

The incident is resolved.