December Failed Jobs Postmortem

In December 2018 we had two incidents (2018-12-14 and 2018-12-19) which resulted in a number of failed jobs. The first one caused 0.8% of jobs to fail (in a 24h window) and the second one caused 1.2% of jobs to fail (in a 24h window). 

Both incidents were caused by unavailability of the Docker container registry (Amazon ECR). In the first incident we were receiving exceeded quota errors and we initially thought that these were related to higher infrastructure load. A thorough investigation showed that we were nowhere near the limits and now we finally got a confirmation from Amazon that this was an error on their side. The second incident was caused by complete unavailability of the ECR for approximately 30 minutes.

Technical background:

The Docker container registry is used to store the executable code for each component running in Keboola Connection. It is accessed on every job run to make sure that a job is run with the most recent version of the component code. During 2017 we moved most of our components to the Amazon ECR which proved to be very reliable. The outage mentioned above is the first one since 2016 when we began using it. 

Most of the Keboola's infrastructure is duplicated with automatic fail-safe mechanisms in place. That means that minor outages in the underlying services are not noticeable by the end-users. Duplicating the Docker container registry, however, is not an easy task because Docker is not really ready for that yet. So this remains a single point of failure.

Measures already taken and yet to be taken:

  • We have immediately implemented a retry mechanism in our code which will handle short outages, the retry mechanism will also be further improved.
  • We have already started (prior to the incident) reworking the component code validation tooling so that the number of queries to the ECR is reduced by several orders. This will help reduce the impact, should a similar incident happen again.
  • We'll use a dedicated ECR for each Keboola Connection region which will reduce the affected scope for any similar incident in the future. 


Job errors

Between 2019-01-15 15:58 and 2019-01-16 8:25 UTC we had a bug in our platform which caused some jobs to fail with user error "Some columns are missing in the csv file". The bug affected jobs where data was imported to Storage with non-default delimiter (default is colon). It is also possible that in some cases an extra column was created in the table. The column contains no data. This column needs to be deleted manually otherwise any subsequent jobs will fail.

We do sincerely apologize for the trouble this may have caused to you. Don't hesitate to contact our support for help.


Snowflake issues

Some Snowflake transformations fail with such error: odbc_exec(): SQL error: [unixODBC]SQL execution internal error: Processing aborted due to error 300010:1087106694; incident 5370475., SQL state XX000 in SQLExecDirect. It seems that there is some issue in the Snowflake warehouse. We will keep you informed.

Update 19:30 UTC: The problem should be fixed in the US region. (Last occurrence in our infrastructure is from 12:08 UTC.) It seems there was an issue specific to some CREATE TABLE AS SELECT and INSERTs that use window functions.

New Year's Deprecations

It is always a good idea to start the New Year with something new. We decided to do it differently and start this New Year with deprecations. Cleaning up deprecated or obsolete parts of our system also has a place in our TODO lists.

So we'd like to announce the following deprecations of a few components in the US region, MySQL transformations and storage bucket/tables attributes. 

Components

Google Drive Writer (wr-google-drive)

  • We announced the new version of Google Drive Writer on June 28th, 2017.
  • This is the last call, the old component will be shut down by the end of this month - January 31st, 2019.

Lucky Guess (rt-lucky-guess)

  • Enhanced Analysis for Redshift backend is no longer available.
  • This component will be shut down on January 31st, 2019.

SalesForce Extractor (ex-salesforce)

  • This component is already deprecated.
  • This component will be shutdown on January 31st, 2019.

YouTube Extractor (ex-youtube)

  • This component is already deprecated and no jobs were run in last months.
  • This component will be shut down immediately.

Zendesk Extractor (ex-zendesk)

  • A new version of the Zendesk extractor was announced on July 27, 2016.
  • Like the YouTube Extractor, no jobs were run in last months and we will shut it down immediately.

Restbox (restbox)

  • This component was deprecated on July 16, 2018 with multiple replacements.
  • Same here, no jobs were run in last months and it will be shut down immediately.

MySQL Transformations

MySQL Transformations were deprecated on November 13, 2017. As we promised, MySQL Transformations were supported in 2018 and will be finally shut down on January 31st, 2019.

Deprecated Bucket and Table attributes

We are deprecating Bucket and Table attributes. These attributes were used as configuration storage for legacy components. If you need to store additional information with buckets or tables, please use the Metadata API.


Security enhancements to GoodData projects access

Direct single sign-on access to GoodData projects from Keboola Connection has been improved and its security enhanced. Due to it, access has been disabled to all existing users and you have to enable it again in Writer's configuration. 

Also, you are no longer able to switch between projects, once you are in GoodData. You have to return back to Connection and access the other project from there. Although it can be a little less comfortable, it brings improved security and we believe it is a better solution for you.

Facebook Ads Extractor Failed Jobs

Between 1:00 - 8:20 CET January 4th, 2019 some Facebook extractor jobs failed after yesterday update of the extractor. We rolled back to the previous working version and continue investigating the issue. Please revise your facebook ads extractor latest jobs and restart if needed. 

We are sorry for this inconvenience.

Update 12:50 CET

The issue has been resolved. The problem was in configurations that used own its authorization token, i.e., not authorized under Keboola Facebook App.

Weeks in review -- January 2, 2019

New components and component updates

  • New FTP extractor is now in beta. It is packed with the same features as AWS S3 extractor and it is extensible with processors. We aim to gradually replace the current FTP extractor which is developed by 3rd party.
  • New fit-into-storage processor that allows to import non-csv files (TXT, JSON) into storage by wrapping them into CSV tables.
  • AWS S3 Extractor fixes bug with file matching. When the Subfolders option was off and Wildcard option was on, the extractor would erroneously download files contained in subfolders exactly matching the key.
  • MongoDB extractor export mode is now set to "raw" by default on new configurations.

Enhancements

Input and output mapping for components is now parallelized. Configurations with large number of tables in input and output mapping (except transformations) should now run considerably faster. This affects both extractors and writers. Most recently updated extractors (e.g. AWS S3, FTP, HTTP, MySQL, MSSQL, Storage) also load tables into storage during the extraction of other tables. Together, these make up to 40% run time reduction for some configurations.

Input mapping load type is now shown in input mapping details:

The Run Orchestration dialog now has the option to select/deselect all tasks:

Developers

The option to have gzipped CSV file on component output mapping (/data/out/tables/) was removed. Only plain CSV files are accepted now. To our best knowledge this was never used.


New version of AdWords Extractor

We have just released a new version of AdWords Extractor. It works with AdWords API v201809 (see the Release notes).

The previous version of the extractor is deprecated and you can use our migration tool which will migrate your AWQL queries. As usual, you have to reauthorize the extractor and give it access to your AdWords data again. The previous version uses AdWords API v201802 which will be switched off on 30 January 2019.

Failing Jobs

Since 2018-12-19, 19:13:00 UTC we're experiencing a higher rate of application errors in all regions due to an outage in AWS ECR. 

We're investigating the issue and update this post once there.

We apologize for the inconvenience.

Update 19:30 UTC

This outage also affects the Developer Portal.

Update 20:05 UTC

As of 19:51 UTC the issue is resolved by Amazon

Failed Jobs

Between 23:16 - 23:22 UTC December 14th, 2018 some jobs failed with Application error due to spikes in infrastructure load. 

We are investigating the root cause and taking measures so that this does not repeat. 

We deeply apologize for the inconvenience caused.

Update [Dec 15, 2018, 08:19 CET]: This still happens occasionally.

Update [Dec 15, 2018, 10:57 CET]: We're still working on mitigating the issue. Comparing to historical hourly avg, actual job error rate is 6% bigger.

Update [Dec 15, 2018, 11:47 CET]: We're going to deploy patch into production. It will take about 2 hours. Expected resolution: 14:00 CET

Update [Dec 15, 2018, 12:45 CET]: Patch has been deployed.