Slowing Down Queue Jobs on AWS EU Stack

We noticed a slowdown in our AWS EU stack’s job queue from 12:00 to 14:00 UTC, due to a temporary service performance issue. We sincerely apologize for any inconvenience this may have caused.

Our team has resolved the problem and we are taking steps to prevent future occurrences. Thank you for your understanding and patience.

Error on installing python packages in transformations

7:20 UTC: We are investigating an error when installing a python package in the transformations.
You may see errors such as: Job "XXXXXXX" ended with a user error "Failed to install package: io".
Next update in 15 min.

UPDATE 7:38 UTC: We have rolled back a previous version and all operations are back normal. We're sorry for this inconvenience. 



List of tables in buckets may not work correctly

Today at 15:30 UTC we noticed a problem with the listing tables in Storage. Tables fail to display only to new users in the project. There can be related issues such as not being able to load data to workspace. All of which apply only to new users. We are seeing this problem across all stacks and regions. 

We are working on a fix, next update in 30 min.

UPDATE 16:10 We have identified the root cause and are working on a fix. Next update in 2 hours.

UPDATE 19:00 The incident is now resolved, and tables are displayed correctly in storage for all users.

We apologize for the inconvenience.

Limited service disruption for AWS US

A limited service disruption on AWS EU stack will start at 15:00 UTC today, as announced earlier. Storage jobs, Queue v1, and Orchestration (in projects with Queue v1) processing will stop and new jobs will be delayed until the upgrade is completed. All running jobs will be cancelled, but will resume after the upgrade.

All APIs and other unaffected services, such as Workspaces and Queue v2 jobs, will remain operational, though their operations may be delayed due to the Storage job delays. We will provide an update when the service disruption starts and ends.

We apologize for any inconvenience caused and thank you for your understanding.

Update 15:00 UTC: The limited service disruption has begun.

Update 15:35 UTC: The service disruption has been resolved and the stack is now fully operational. 

Thank you for your patience.

Investigating higher latency through all stacks

As of 29 November 13:45 UTC, we are investigating higher latency for some requests in the through all stacks.
  • It might leads to errors in the UI
  • Job processing is not affected
We'll be doing a rollback to the previous version. Next update in 30 min.

UPDATE 2023-11-29 13:20 UTC - All operations are back to normal and everything is fully working.

Extractor Microsoft SQL Server internal error across all stacks

The latest version (8.2.0) of Microsoft SQL Server Extractor terminates with an internal error. This version was deployed yesterday, and we are currently performing a rollback Next update will be available in 15 minutes.

[Resolved] UTC 07:56: We have rollback to version 8.1.1, and the extractions are now functioning without any issues. We apologize for any inconvenience caused.

Telemetry: Missing credits for writer jobs in projects recently migrated to the new queue [resolved]

We have discovered that some writer jobs in the projects that were migrated to the new job queue (Queue V2) after the beginning of May are missing information about the data transferred. That information is used to calculate the number of credits consumed by those jobs.

We will deploy a fix tomorrow (10th Aug), which will add missing credits to the jobs affected. For affected projects regularly using writers, the result may be that they have a higher recorded consumption of credits.

The issue is related solely to the telemetry and does not affect Keboola Connection in any way. Moreover, it affects the telemetry only for projects that were recently migrated to Queue V2.

UPDATE 2023-08-10 11:04 UTC: The fix was deployed and the affected writer jobs show consumed credits again.

Detailed description of the issue

When a project is migrated to Queue V2, any jobs created in the past several months are also migrated, so that the user can keep track of what is going on in their Keboola project UI. Jobs in both Queue V1 (the old queue) and Queue V2 contain information about the data transferred by these jobs as different metrics. However, this information is not passed from an original job to the corresponding migrated one during the migration process.

Generally, Queue V1 jobs take precedence over Queue V2 jobs. To prevent any issues, they are used in the telemetry calculations, rather than the migrated jobs, as they have the original data.

In May, to speed up the telemetry calculations, the input mapping of Queue V1 jobs in a transformation was switched so that only data updated in the last 30 days was incrementally loaded for further processing.

As noted above, when a project was migrated to Queue V2, migrated jobs were also created. So, when processing jobs, loads of migrated jobs from the past several months were processed but only recently updated Queue V1 jobs (from the last 30 days) were processed alongside them. Thus, the older Queue V1 jobs could not take precedence over the newer migrated Queue V2 jobs, so the latter were incorrectly used for the telemetry output. Hence, information is missing about transferred data, resulting in no credits.

For the bug fix, a transformation will now always load the entire history of Queue V1 jobs to prevent migrated jobs from incorrectly being used in telemetry calculations.

Storage job failures in the AWS EU stack

We are observing an increased number of faulty storage jobs, resulting in the error message "Cannot import data from Storage API“ in connection.eu-central-1.keboola.com. The main cause has been identified and resolved, and now all systems should be running smoothly. We will continue to monitor the situation, and the next update will be provided in 30 minutes.

We apologize for any inconvenience caused.

UPDATE 7:20 UTC [resolved] All systems are functioning normally, and the incident has been resolved and closed.