Orchestrator stuck on application error during job creation on north-europe.azure.keboola.com

2024-10-14 16:30 UTC  We are observing a small number of instances where errors occur during job creation, and you may encounter the error message: “Decryption failed: Deciphering failed.” As a result, orchestrations may become stuck in the terminate state. If you experience this issue, please contact our support team.

We are actively investigating the situation and will provide an update later this evening.

2024-10-14 21:40 UTC We have successfully identified the affected orchestrations and deployed a fix that automatically terminates them. We now consider this incident resolved. We sincerely apologize once again for the inconvenience caused.

Errors in the AWS EU Stack

We are experiencing problems on our AWS EU stack (https://connection.eu-central-1.keboola.com/). We are deeply sorry for the inconvenience this may cause. In the user interface, you can issue an error alert or task slowdown processing jobs. Next update 30 minut.

Sep 26 08:34 UTC: We identified and fixed an overload on one of our Kubernetes node. All systems are now running normally. We’ve implemented measures to prevent recurrence.

Thank you for your patience.

FTP and S3 extractors potential primary key modifications

2024-09-23 13:48 UTC - We are currently investigating potential modifications to the primary key for the FTP and S3 extractors that  occurred around September 13th, 2024. The issue has already been reverted, and we are conducting an analysis. We will provide more information as soon as we have further details.

UPDATE 2024-09-23 16:36 UTC - Our analysis confirms that no projects on single-tenant stacks were affected by the issue. We are continuing with the analysis of multi-tenant stack projects and will provide more information as soon as we have further details.

UPDATE 2024-09-25 9:30 UTC - Our analysis has been completed, and we now have a list of affected configurations. The issue with potential primary key modifications may have impacted not only FTP and S3 extractors but also other components using the Processor Create Manifest. We would like to highlight that not all configurations with this processor were affected.

In cases where a configuration experienced a primary key modification, the key was automatically restored after the next run. However, a small number of configurations did not revert to their original primary key due to duplicate records in the table.

These cases are limited, and the clients affected by this issue will be contacted individually by our support team today with further steps and recommendations.

If you have any questions or concerns, please reach out to our support team.

Degraded Performance on GCP EU Stack

Sep 20 15:15 UTC: We are experiencing degraded performance on our GCP EU stack (https://connection.europe-west3.gcp.keboola.com/). We are deeply sorry for the inconvenience this may cause and appreciate your patience as we work through it. We will provide further updates as soon as we have more information.

Sep 22 18:53 UTC update:We have gained additional understanding of the performance degradation. The root cause seems to be in intermittent slowdown of query execution on Snowflake. While the execution of single query is not delayed significantly, the delays do accumulate to noticeable slowdowns of minutes on transformations consisting of multiple queries and even more for entire flows. We're in touch with Snowflake support in the process of uncovering all the technical details.

The symptoms of the performance degradation include longer running times of jobs - especially on Snowflake transformations, but Data Source jobs and Data Destination jobs are also affected, because they load/unload data from a Snowflake database. The performance degradation occurrence is random and somewhat time dependent so not all flows are affected in the same manner. The performance degradation does not cause any errors.

Sep 24 07:10 UTC update: We have now confirmed the root cause to be in slower execution of Snowflake queries at certain times. We have implemented a temporary resource increase to improve the situation. This means that you should see improved job run times. We're still working with Snowflake support on the solution.

Oct 1 7:54 UTC updateWe are still working on the resolution together with Snowflake support. The temporary resource increase is still in place. This means that that the situation is contained and overall stack performance should be acceptable, but not perfect. 

At this moment we don't have a solution ready, nor an ETA for one. Thank you for your understanding and patience. We will provide further updates as soon as we have more information.

Issues across all Azure stacks

We're investigating issues across all Azure stacks. Next update in 15 minutes or when new information is available.

UPDATE 14:35 CEST: The issue seems to be resolved, we're still evaluating the impact. Next update in 30 minutes or when new information is available.

UPDATE 14:55 CEST: The outage was caused by routine maintenace of databases in Auzure. All running jobs affected during this outage should restart automatically. Some of the affected Storage jobs may be executed twice and in very rare cases when using incremental load without a primary key this could lead to data duplication.

We're sorry for this inconvenience, we will be taking measures to decrease impact of future maintenance events. 

For any additional questions, please contact support.

New and improved Jobs API endpoint

We've introduced a new and improved Jobs API endpoint, /search/jobs, which replaces the current /jobs endpoint. 

The new endpoint should be near drop-in replacement, supporting the same parameters and the same results, but with improved performance, especially for larger projects and complex filters. 

The old endpoint is now considered deprecated and will be removed some time in the future (will be announced separately).  

A major difference is that the new endpoint returns results from a secondary database, which is synchronized with a slight delay. 

This means a new job may not be shown right away after it's created and job detail may contain slightly outdated data if it was updated recently. The delay should be couple seconds at max during normal operation. 

Other difference is the new endpoint will return maximum 500 items per page.

See the Job Queue API documentation for more details.