Snowflake Outage (US Region)

We have encountered an increased number of Snowflake connection errors between 04:11 UTC and 04:21 UTC. This may have caused failed storage and component jobs.

Furthermore Snowflake announced possible SQL query failures between 02:24 UTC and 03:30 UTC. 

We're sorry for this inconvenience, all systems are operational now. 

Investigating incident (US region)

We're currently investigating an incident in the US region that may have caused component job and orchestration failures today, Tuesday, Sep 11, between 02:30 UTC and 06:30 UTC.

Update 08:30 UTC

The DB storing job locks (preventing jobs from running multiple times on multiple workers) was restarted at 02:33 UTC. All connections were terminated and all component jobs and transformations running at that time were disconnected. This could lead to any or both of the following situations:

  • Failure at any time later during the job execution
  • A new parallel job execution

Any jobs started after the DB restart were not affected by this issue.

We apologize for this inconvenience. We're planning an infrastructure change to prevent such huge impact during similar situations. 

If you have any further questions, please use the support button in your project.

Restbox Component Deprecation Announcement

The time has come and the Restbox component as we know it is now being deprecated. But do not worry, its functionality will not disappear.

The Restbox component was once a great tool to collect data from various sources that were not complex enough to deserve a dedicated extractor (or writer) and shared similar processing aspects, such as decompression, audit tools, and CSV formatting. This has now been broken up into multiple separate components and processors to achieve or exceed the required functionality.

Today we have the following components to replace the Restbox component:

Detailed documentation for most of them can be found at help.keboola.com.

The Restbox component is now available in the US region only, and, as of this announcement, no new configurations can be created. The component will be supported until November 1, 2018 when it will be finally shut down. Please migrate your configurations to one of the above-mentioned components. 

If you have any trouble migrating your configurations, please do not hesitate to contact us using the support button in your project.

Week in Review -- June 19, 2018

Updated Components

Minor Improvements

  • Setting up MFA in Keboola Connection can display the code in plain text where QR is not available.
  • We have changed Docker storage driver for Docker Runner job workers. We hope this will stabilize docker response times and minimize startup and shutdown overhead. This will be most significant when running short jobs.
  • Editing components in Keboola Developer Portal got a facelift.

Docker Jobs Application Errors

Unfortunately tonight there were a few more unexpected Application errors and delayed or longer running jobs between 1:25am–4:55am CEST (4:25pm–7:55pm) in the US region.

We have experimented with different storage drives (swapping from SSD to throughput optimized HDD) which lead to initial issues with building Custom Science apps. Attempts to provision further resources lead to too many running jobs at once (you could see "SQLSTATE[HY000] [1040] Too many connections" in the failed app events) and removing some of the additional resources could have yielded some other Application errors too. 

Currently we're running SSD drives again with enough resources to process all workloads. Please restart your failed jobs.

We hope we'll be able to stabilize this whole unfortunate situation as soon as possible and we're very sorry for inconvenience.

Docker Jobs Application Errors

Unfortunately we we're unable to find a fix for yesterday's failures, so on Thursday June 7th between 3:49am CEST and 7:38am CEST (1:49am–5:38am UTC, 6:49pm–10:38pm PT) there was an increased application error rate on our Docker host instances in the US region.

The servers are now stabilized and it is safe to restart the failed jobs.

We're looking into this issue. We have started additional instances to help with the load and we'll be looking into the HW architecture of the instances to help us figure out what causes the issue. Meanwhile we'll try to implement a retry on such failed jobs.

We're sorry for this inconvenience.

Docker Jobs Application Errors

On June 6th between 2:15am CEST and 2:35am CEST (5:15pm PT and 5:35pm PT, 12:15am UTC and 12:35am UTC) there was an increased rate of application errors on one of our Docker host instances in the US region. The instance is now fully operational and the jobs are safe to restart.

Furthermore one of our EU region Docker host instances went down at 6:56am CEST and caused a few unexpected application errors. There is a new one in place, we recommend restarting any failed jobs.

We're sorry for this inconvenience, we're working on preventing these errors in the future.


Unexpected Job Failures

Between April 28 2:30 and 3:15 UTC there was a high rate of application errors on one of our instances processing component jobs. 

The instance was under heavy load and we're investigating the root cause. Instance is now back to normal and is safe to restart the jobs.

We're sorry for any inconvenience.