Failing Jobs

Since 2018-12-19, 19:13:00 UTC we're experiencing a higher rate of application errors in all regions due to an outage in AWS ECR. 

We're investigating the issue and update this post once there.

We apologize for the inconvenience.

Update 19:30 UTC

This outage also affects the Developer Portal.

Update 20:05 UTC

As of 19:51 UTC the issue is resolved by Amazon

Failed Jobs

Between 23:16 - 23:22 UTC December 14th, 2018 some jobs failed with Application error due to spikes in infrastructure load. 

We are investigating the root cause and taking measures so that this does not repeat. 

We deeply apologize for the inconvenience caused.

Update [Dec 15, 2018, 08:19 CET]: This still happens occasionally.

Update [Dec 15, 2018, 10:57 CET]: We're still working on mitigating the issue. Comparing to historical hourly avg, actual job error rate is 6% bigger.

Update [Dec 15, 2018, 11:47 CET]: We're going to deploy patch into production. It will take about 2 hours. Expected resolution: 14:00 CET

Update [Dec 15, 2018, 12:45 CET]: Patch has been deployed.

Outage Post-Mortem: December 13th, 2018

There was an outage of the Keboola Connection platform in the US region from Dec 13, 2018, 23:19 CET to Dec 14, 2018, 02:28 CET. It was caused by an update of the Elasticsearch Service.

It was a self-service update of the managed Amazon Elasticsearch Service, and we decided to do it because updates were tested and all other clusters we're using were updated successfully - without any issues.

Updates like this usually go smoothly, but not this time. Unfortunately, the cluster froze and refused to accept any requests.

After detecting this, we decided to start a new service and restore from backup.

All services were fully restored to their normal state by Dec 14, 2018, at 02:28 CET.

We have already taken action to prevent this kind of failure from happening again, and we'll be testing all future updates on testing clusters (using snapshots from production clusters).

Any orchestrations in the US region that were scheduled to start during the outage were not started. Therefore you will need to run them manually or wait until their next scheduled run.

We want to sincerely apologize for the inconvenience caused by this outage.

We are experiencing momentary technical difficulties in US

As of around 23:30 CET Dec 13, 2018 we observed technical issues related to out job servers, we are continuing to investigate and will update as soon as we know more.

  • Update: [Dec 13, 2018 23:55 CET] - We're hot on the trail and expect to have the issue resolved shortly.
  • Update: [Dec 14, 2018 02:43 CET] - Service is now fully restored in US region

We're very sorry for any inconvenience.

To make recovery faster we boosted Snowflake performance for next 24 hours.

And you'll hear from us with "post mortem" soon.

Weeks in review -- December 10, 2018

New Features

Partial Label

We show a PARTIAL label for jobs which didn't run the whole configuration, but only part of it. Typically these are jobs when only one transformation is run from a bucket or when one file is exported using AWS S3 extractor, HTTP extractor, etc.

Transformation description, Last Runs, and Updates

  • There's a new option to save transformation with description.

  • The Last Runs section and Updates section have been added also to the Transformation Detail page and all components which support "single runs" (e.g. when HTTP extractor or AWS S3 extractor is extracting one file)

Input Mapping data types

In transformations with Snowflake backend, data types should be populated automatically for tables created with the components which set data types - MySQL extractor, Oracle extractor, MSSQL Server extractor, PostgreSQL extractor, DB2 extractor, and Snowflake extractor.

Input Mapping load type

For Snowflake backend, you can set Clone Table load type which leads to an ultra-fast load to a workspace. Most of your tables should be loaded under 10 seconds.

You can find more about this feature in our documentation.


Enhancements

  • You can see a component type in Orchestration Tasks, thus having both MySQL writer and MySQL extractor in task won't be confusing anymore
  • API Token can be Refreshed, Sent or Deleted directly from the token detail page

  • We improved how data is loaded to Storage from our components - compression is used by default. Loads to Storage are about 40% faster for sliced files. So it impacts Credit Consumption too.
  • MSSQL Server Extractor has new option WITH(NOLOCK) and supports incremental fetching with smalldatetime column
  • Orchestration Notifications page has been improved, and notifications can be set easier

New components and component updates

New Orchestration Detail page

The orchestration section is one of the last parts of Keboola Connection which does not yet have a unified user interface. We believe that users should feel comfortable using the different parts of our UI, so unifying the interface elements is very important to us.

We have launched the new Orchestration Detail page

What has changed?

  • There is no longer a sidebar on the left side with a list of orchestrations
  • Tasks, Schedule and Notifications now have their own place in the detail page - they're no longer combined in that table
  • We added new sidebar on the right side that you should feel familiar with because it is the same as other components
  • Orchestration Action buttons (Run, Enable/Disable and Delete) have been moved to the right sidebar
  • Also, information about creation date and updates now also have their own place in the right sidebar

This is the beginning of the Orchestration Interface tuning and there are more things to come.

---

Just for completeness, this is the previous version:

28 November, 2018 -- Storage Job Failures in EU

During deployment of Keboola Connection in the EU region at around 13:30 CET today there was a timing issue that resulted in some failed storage jobs.

The problem occurred because of the asynchronous timing between the rollover of a worker and the API server.  A small number of jobs that were created with the previous version then failed when executed against the new API.  These jobs failed with one of the following error messages "`Invalid source type" or "Workspace not found".

There was no corruption of data by these failures as no jobs that were in progress were affected.

 

28 November, 2018 -- Stalled Jobs in the EU region

Around 5:00am CEST one of the job worker instances stopped processing assigned jobs. This could have lead to jobs being stuck in the processing state for a long time without any activity.

At 9:30am CEST the worker instance was terminated and all unfinished jobs started processing on other instances.

We're sorry for this inconvenience.

Week in review -- November 27, 2018

New Features

You can now provide your project with a markdown description on your project's Overview page

Updated Components

SQL Server Extractor now supports incremental fetching on DateTime and Identity columns

Unexpected Events

Many of you have noticed that we added validation of SQL queries to the Transformation Detail page last week.

Due to some problems in availability of this feature we decided to revert this change.

For now you can still use the "standard" validation button on the right.

We love this feature too and the plan is to bring it back after an additional round of improvements.