Failing Orchestrations in the US Region

[2019-04-13 8:20 CET]

We are registering increased amount of failing orchestrations since 2:27 CET. We are investigating the issue now and will keep you updated.


[2019-04-13 8:45 CET]

We traced down the problem and fix it. Everything seems to be working now, we continue monitoring it. Please review your orchestrations and restart them if necessary



Week in review -- April 15, 2019

Updated Components

DB extractors:

  • Source shows schema as well as table name in tables list. Also it's visible from first glance which tables are created from a storage table and which are from SQL. 

  • Table schema is shown also in config rows detail sidebar

  • When adding tables you can click the schema name and all tables in schema at once


Delayed jobs in EU region

Execution of some table import jobs scheduled after 07:42 UTC was delayed up to 30 minutes. The delay was caused by new platform release which was immediately rolled back. All systems are now operational.

Orchestration Failures in the US Region

Today, on March 15 2019 from 16:34:15 UTC to 16:35:12 UTC there were some orchestration failures in the US region due to an internal system upgrade.

There were not many failures (around 20, so very few are affected), but if you had an orchestration running at that time, please check to make sure that you were unaffected.

We are working on making sure that this will not happen again for any future upgrades.

Snowflake issues in EU region

We were affected by a brief outage of the EU region Snowflake database on Mar 07 between 17:45:00 UTC and 18:25:00 UTC in the EU region. The problem affected extractors and transformations. Please check your orchestrations and re-run them if necessary. Projects in the US region were unaffected. We apologise for the inconvenience caused.

R/Python sandboxes security update

We need to apply an important OS-level security update to R/Python sandboxes environment. Because of that, the existing sandboxes cannot be extended. This means the following:

  • R/Python sandboxes created prior 2019/02/12 will be terminated no later than 2019/02/17 14:00 UTC even if you try to extend them.
  • If you wish to keep the contents of sandbox created prior 2019/02/12 14:00 UTC, please save them manually and recreate the sandbox
  • R/Python sandboxes created after 2019/02/12 14:00 UTC are unaffected
  • SQL sandboxes are unaffected

Weeks in review -- February 8, 2019

Component Updates

  • Python Transformations - now uses the same Python version 3.7.2 as in the transformation sandbox.
  • R transformations have a new backend (v 3.5.2), and we added docs about how to do opt-in in the new version.
  • Storage Writer - now supports the `recreate` mode that will drop and create the target table.
  • Processor Decompress - supports graceful decompression, will skip the file that failed to decompress.
  • Mysql/Mssql/ extractors - allow any numeric or datetime type for incremental fetching.
  • PostgreSQL -  has automatic increment fetching. UI has to be migrated to the new version (by the green button in the config overview).
  • Generic Extractor now supports usage of deeply nested functions.
  • Zendesk Extractor - fixed extracting of custom ticket values fields, existing configurations need to be resaved (switch to template->scroll to the bottom-> select a template again and save).
  • New component for Mailgun (sending emails).


UI Updates

  • Generic Snowflake sandbox - now uses CLONE TABLE load type. It's way faster and it only loads complete tables (no rows sampling).
  • You can choose a backend version of R/Python transformations.
  • Snowflake writer - adding a new table now autoloads column datatypes if present (usual for tables originated from db extractors).
  • Transformations Output  - shows warning when there are 2 output mappings with the same destination table within one phase.
  • PostgreSQL Extractor - query editor now supports PostreSQL specific syntax.


Storage and Project Management Updates

  • All newly created tables in Storage have 16MB cell size instead of 1MB.
  • Limit 110 columns in data preview were removed, contents of wider tables are displayed normally.
  • Organization invitations are now working similarly to project invitations - an invited user has to accept the invitation.



KBC is not accessible in all regions

[2019-01-22 1:21 UTC]

Snowflake just announced that disabling OCSP check is able circumvent the error. KBC is fully working, you shouldn't have any issue for now!


[2019-01-22 00:51 UTC]

We were removing all OCSP validation and KBC platform is working OK in both (US/EU) regions for now

At this time, we have no more updates from Snowflake support team yet. You shouldn't have any issue with Keboola Connection. In case of any hiccups, please open ticket directly from your KBC project. Once we have RCA report from Snowflake, this post will be updated.

We're very sorry for this inconvenience and thank you so much for your patience with us and Snowflake engineers.


[2019-01-22 00:34 UTC]

SQL Sandboxes are fully working. Take care that all existing credentials were discarded - use new combination of username and password.


[2019-01-22 00:16 UTC]

Just a few components are still having an issue. 

To make up for this outage we're going to add additional resources and run your jobs in Keboola Connection for next few hours on Warp Drive.


[2019-01-22 00:08 UTC]

Almost everything is working now. Last issues are in Transformation Sandboxes.


[2019-01-22 00:02 UTC]

We're very close to fully working platform. Bear with us! 


[2019-01-21 23:50 UTC]

Component jobs are still serving errors from Snowflake DWH. We're disabling OCSP checks on other places in our infrastructure.

 

[2019-01-21 23:37 UTC]

Snowflake just confirmed its SSL validation issue in their ODBC driver (https://community.snowflake.com/s/group/0F90Z000000U8d9/alerts-awsus-west).


[2019-01-21 23:35 UTC]

US region is working.


[2019-01-21 23:34 UTC]

EU region is working.


[2019-01-21 23:32 UTC]

We're building app version with temporarily modified OSCP checks.


[2019-01-21 23:17 UTC]

This issue seem to be connected with OCSP cert validation on Snowflake side. We're still working on it.


[2019-01-21 22:50 UTC]

Starting 2019-01-21 22:33 UTC, all customers are seeing error messages throughout their account. We’re aware of the issue and are working on it urgently.

We’re really sorry to be holding you up today! Please know our engineering and operations teams are working hard to get everything up and running and we will update you right here in 30 minutes with the latest information.

December Failed Jobs Postmortem

In December 2018 we had two incidents (2018-12-14 and 2018-12-19) which resulted in a number of failed jobs. The first one caused 0.8% of jobs to fail (in a 24h window) and the second one caused 1.2% of jobs to fail (in a 24h window). 

Both incidents were caused by unavailability of the Docker container registry (Amazon ECR). In the first incident we were receiving exceeded quota errors and we initially thought that these were related to higher infrastructure load. A thorough investigation showed that we were nowhere near the limits and now we finally got a confirmation from Amazon that this was an error on their side. The second incident was caused by complete unavailability of the ECR for approximately 30 minutes.

Technical background:

The Docker container registry is used to store the executable code for each component running in Keboola Connection. It is accessed on every job run to make sure that a job is run with the most recent version of the component code. During 2017 we moved most of our components to the Amazon ECR which proved to be very reliable. The outage mentioned above is the first one since 2016 when we began using it. 

Most of the Keboola's infrastructure is duplicated with automatic fail-safe mechanisms in place. That means that minor outages in the underlying services are not noticeable by the end-users. Duplicating the Docker container registry, however, is not an easy task because Docker is not really ready for that yet. So this remains a single point of failure.

Measures already taken and yet to be taken:

  • We have immediately implemented a retry mechanism in our code which will handle short outages, the retry mechanism will also be further improved.
  • We have already started (prior to the incident) reworking the component code validation tooling so that the number of queries to the ECR is reduced by several orders. This will help reduce the impact, should a similar incident happen again.
  • We'll use a dedicated ECR for each Keboola Connection region which will reduce the affected scope for any similar incident in the future. 


Job errors

Between 2019-01-15 15:58 and 2019-01-16 8:25 UTC we had a bug in our platform which caused some jobs to fail with user error "Some columns are missing in the csv file". The bug affected jobs where data was imported to Storage with non-default delimiter (default is colon). It is also possible that in some cases an extra column was created in the table. The column contains no data. This column needs to be deleted manually otherwise any subsequent jobs will fail.

We do sincerely apologize for the trouble this may have caused to you. Don't hesitate to contact our support for help.