Job errors in the EU region [post-mortem]

Summary

On Sunday, September 15th at 01:32 UTC, orchestrator and other component jobs started failing in the EU region. In the following hours, our worker servers weren't able to handle the workload, and the job backlog started to increase. We manually resolved the incident, and the platform was in full operation with a clean backlog at 08:26 UTC.

What happened?

One of the MySQL instances was automatically restarted and patched on September 15th at 01:32 UTC.
The instance is required for the lock mechanism for job processing, and it also stores information about queues for the worker servers. The 2-minute downtime of the database instance caused a failure of the jobs that were running at the moment. Additionally, the running workers weren't able to fetch the information about the queues, and some of them gave up restarts and stopped. With only half of the processing capacity left, the workload could not be processed.

Once we discovered the incident, we replaced all our worker servers and added more capacity to clean up the backlog faster.

What are we doing about this?

We have implemented notifications about upcoming instance patches and are going to perform updates during scheduled and announced maintenance windows.

We are also working on a completely new job processing and scheduling mechanism that will prevent similar issues from occurring down the road. We sincerely apologize for the inconvenience caused.


Week in Review -- September 16, 2019

New Features, Improvements and Minor Fixes


New Components

Job errors in EU region

We are investigating job failures in EU region started at 1:32 UTC.

We will provide an update when we'll have more information. 

UPDATE 06:06 UTC - We have identified the issue and fixed the cause. Backlog is processing now.

UPDATE 07:54 UTC - There is still backlog of orchestration jobs. We have increased the processing capacity. It should be cleared in half an hour.

UPDATE 08:26 UTC - The backlog was cleared. All services are running.

We apologize for the inconvenience, we'll share more details in a post-mortem.

Week in Review -- September 9, 2019

New Features, Improvements and Minor Fixes

  • When checking events in the job detail, the Load More button tries to load up to 1,000 events at once.
  • You can search in component configurations when adding a new task to an orchestration phase.

  • Refreshing your token no longer breaks access to your project.
  • Remove Empty Files and Folders processor has a new option available to remove files with whitespace characters only: remove_files_with_whitespace


New Components

Jobs failures and timeouts due to AWS shortage in US

We are experiencing some jobs failures and timeouts in the US region due to a shortage in one availability zone in the Amazon Elastic Compute Cloud service. We are going to monitor the situation and keep you posted.

UPDATE 15:48 CEST: Apparently it has some overlap to other AWS services too because login to Developer Portal (which uses AWS Lambda and Cognito) timeouts intermittently.

UPDATE 15:54 CEST: AWS confirms that some EC2 instances are impaired and some EBS volumes are experiencing degraded performance within a single Availability Zone in the US-EAST-1 Region. Some EC2 APIs are also experiencing increased error rates and latencies. They are working to resolve the issue.

UPDATE 16:37 CEST: The works on resolving the issue are still in progress.

UPDATE 17:06 CEST: The impaired instances and EC2 APIs are being recovered. AWS support continues to work towards recovery for all affected EC2 instances.

UPDATE 18:04 CEST: Recovery is in progress for instance impairments and degraded EBS volume performance. On our side, it looks that the problems more or less disappeared an hour ago and the platform is back to normal.

Broken Login [Post-Mortem]

Summary

On 2019-08-15 12:23 UTC, we deployed a broken version of Keboola Connection. It prevented some users from accessing their projects. The problem was fixed at 13:01 when we rolled back to a previous version. We sincerely apologize for interrupting your work and wasting your time.

What Happened?

There was an error in a permission check, and only users with the permission to create a project were allowed to enter a project. Such a scenario is not covered in the functional tests, and the situation was overlooked during peer review of the change. As soon as we identified the problem, we immediately deployed a previous version of Keboola Connection. That itself took about 15 minutes.

What Are We Doing About It?

We're extending the software tests to include more scenarios.

We're also updating monitoring alarms to make sure that we know about a problem before you tell us through our support channel.


Week in Review - August, 16th, 2019

New Features

  • Tables in shared buckets and table aliases now contain metadata from the source tables. In practice, this means that when you create an input mapping for transformations using tables in shared buckets (or aliases), you see the source table data types.

  • In the API response, the source table metadata are contained in the `sourceTable` node – so both the table and alias metadata are available.

New Components

A number of components by Revolt BI:

Updated Components

GoodData Writer

  • Supports reading the Logical Data Model (LDM) from a project.

Minor Improvements & Fixes

  • Google Drive verification issue in the EU region has been resolved.

  • In the input mapping of Snowflake transformations, the TIMESTAMP data type now defaults to TIMESTAMP_NTZ.

  • Terminated job is colored in the same way as a terminated label.