Jobs delays on connection.eu-central-1.keboola.com stack

2024-10-19 10:26 UTC We have noticed a slowdown in the processing of jobs on the https://connection.eu-central-1.keboola.com stack, but the jobs shouldn't end with an error.

Update 2024-10-19 10:57 UTC The problem has been identified and solved, the platform should be stable again. 

We apologize for the inconvenience.

Scheduled Partial Maintenance for Keboola AWS, Azure, and GCP Stacks – October and November 2024

We would like to inform you about the planned maintenance of Keboola stacks hosted on AWS, Azure, and GCP.

This maintenance is necessary to keep our services running smoothly and securely. Please note the following schedules and the stacks affected.

Maintenance of all Azure Stacks – October 26, 2024

During database upgrades there will be a short service disruption on all Azure stacks, including all single-tenant stacks and Azure North Europe multi-tenant stack (connection.north-europe.azure.keboola.com). This will take place on Saturday, October 26, 2024 between 11:30 and 12:30 UTC.

Effects of the Maintenance

During the above period, services will be scaled down and the processing of jobs may be delayed. For a very brief period (at around 12:00 UTC) the service will be unavailable for up to 10 minutes and APIs may respond with a 500 error code. After that, all services will scale up and start processing all jobs. No running jobs, data apps, or workspaces will be affected. Delayed scheduled flows and queued jobs will resume after the maintenance is completed.

Detailed Schedule

  • 11:30–12:00 UTC: processing of new jobs stops.

  • 12:00–12:15 UTC: service disruption

  • 12:15 UTC: processing of jobs starts.

GCP EU and US Stack Maintenance – November 2, 2024

During database upgrades there will be a short service disruption on both GCP multi-tenant stacks. Here is the schedule.

Effect of the Maintenance

During the above periods, services will be scaled down and the processing of jobs may be delayed. For very brief periods (at around 8:00 UTC and 8:30 UTC, respectively) the service will be unavailable for up to five minutes and APIs may respond with a 500 error code. After that, all services will scale up and start processing all jobs. No running jobs, data apps, or workspaces will be affected. Delayed scheduled flows and queued jobs will resume after the maintenance is completed.

Detailed Schedule for eu-west3

  • 7:30–8:00 UTC: processing of new jobs stops.

  • 8:00–8:10 UTC: service disruption

  • 8:15 UTC: processing of jobs starts.

Detailed Schedule for us-east4

  • 8:00–8:30 UTC: processing of new jobs stops.

  • 8:30–8:40 UTC: service disruption

  • 8:45 UTC: processing of jobs starts.

AWS EU and US Stack Maintenance – November 16, 2024

During database upgrades there will be a limited service disruption on our AWS multi-tenant stacks. Here is the schedule.

Effect of the Maintenance

The maintenance is expected to last no longer than 15 minutes, during which jobs may be delayed. While you will be able to log into the Keboola platform, starting new jobs will not be possible during the maintenance. Jobs already running will not be canceled—only delayed. Running data apps or workspaces will not be affected. Scheduled jobs will automatically start after the maintenance is completed.

Orchestrator stuck on application error during job creation on north-europe.azure.keboola.com

2024-10-14 16:30 UTC  We are observing a small number of instances where errors occur during job creation, and you may encounter the error message: “Decryption failed: Deciphering failed.” As a result, orchestrations may become stuck in the terminate state. If you experience this issue, please contact our support team.

We are actively investigating the situation and will provide an update later this evening.

2024-10-14 21:40 UTC We have successfully identified the affected orchestrations and deployed a fix that automatically terminates them. We now consider this incident resolved. We sincerely apologize once again for the inconvenience caused.

Errors in the AWS EU Stack

We are experiencing problems on our AWS EU stack (https://connection.eu-central-1.keboola.com/). We are deeply sorry for the inconvenience this may cause. In the user interface, you can issue an error alert or task slowdown processing jobs. Next update 30 minut.

Sep 26 08:34 UTC: We identified and fixed an overload on one of our Kubernetes node. All systems are now running normally. We’ve implemented measures to prevent recurrence.

Thank you for your patience.

FTP and S3 extractors potential primary key modifications

2024-09-23 13:48 UTC - We are currently investigating potential modifications to the primary key for the FTP and S3 extractors that  occurred around September 13th, 2024. The issue has already been reverted, and we are conducting an analysis. We will provide more information as soon as we have further details.

UPDATE 2024-09-23 16:36 UTC - Our analysis confirms that no projects on single-tenant stacks were affected by the issue. We are continuing with the analysis of multi-tenant stack projects and will provide more information as soon as we have further details.

UPDATE 2024-09-25 9:30 UTC - Our analysis has been completed, and we now have a list of affected configurations. The issue with potential primary key modifications may have impacted not only FTP and S3 extractors but also other components using the Processor Create Manifest. We would like to highlight that not all configurations with this processor were affected.

In cases where a configuration experienced a primary key modification, the key was automatically restored after the next run. However, a small number of configurations did not revert to their original primary key due to duplicate records in the table.

These cases are limited, and the clients affected by this issue will be contacted individually by our support team today with further steps and recommendations.

If you have any questions or concerns, please reach out to our support team.

Degraded Performance on GCP EU Stack

Sep 20 15:15 UTC: We are experiencing degraded performance on our GCP EU stack (https://connection.europe-west3.gcp.keboola.com/). We are deeply sorry for the inconvenience this may cause and appreciate your patience as we work through it. We will provide further updates as soon as we have more information.

Sep 22 18:53 UTC update:We have gained additional understanding of the performance degradation. The root cause seems to be in intermittent slowdown of query execution on Snowflake. While the execution of single query is not delayed significantly, the delays do accumulate to noticeable slowdowns of minutes on transformations consisting of multiple queries and even more for entire flows. We're in touch with Snowflake support in the process of uncovering all the technical details.

The symptoms of the performance degradation include longer running times of jobs - especially on Snowflake transformations, but Data Source jobs and Data Destination jobs are also affected, because they load/unload data from a Snowflake database. The performance degradation occurrence is random and somewhat time dependent so not all flows are affected in the same manner. The performance degradation does not cause any errors.

Sep 24 07:10 UTC update: We have now confirmed the root cause to be in slower execution of Snowflake queries at certain times. We have implemented a temporary resource increase to improve the situation. This means that you should see improved job run times. We're still working with Snowflake support on the solution.

Oct 1 7:54 UTC updateWe are still working on the resolution together with Snowflake support. The temporary resource increase is still in place. This means that that the situation is contained and overall stack performance should be acceptable, but not perfect. 

At this moment we don't have a solution ready, nor an ETA for one. Thank you for your understanding and patience. We will provide further updates as soon as we have more information.

Issues across all Azure stacks

We're investigating issues across all Azure stacks. Next update in 15 minutes or when new information is available.

UPDATE 14:35 CEST: The issue seems to be resolved, we're still evaluating the impact. Next update in 30 minutes or when new information is available.

UPDATE 14:55 CEST: The outage was caused by routine maintenace of databases in Auzure. All running jobs affected during this outage should restart automatically. Some of the affected Storage jobs may be executed twice and in very rare cases when using incremental load without a primary key this could lead to data duplication.

We're sorry for this inconvenience, we will be taking measures to decrease impact of future maintenance events. 

For any additional questions, please contact support.