Jobs processing AWS EU central stack slowing down

2023-09-13 12:55 UTC - Processing of jobs in AWS EU stack has been slowed down, we're investigating the issue.

2023-09-13 13:43 UTC - We see minor improvements. We're still investigating the issue, next update when new information will be available.

2023-09-13 14:31 UTC - The backlog of jobs has been cleared, and job processing should be functioning properly once more. We had to temporarily disable the "Child Jobs" checkbox on the Jobs page to address the issue. We are continuing to monitor the situation.

Jobs processing AWS EU central stack slowing down

13:10 UTC the processing of jobs in AWS EU stack has been slowed down, we're investigating the issue

13:40 UTC the issue has been resolved, jobs are now being processed normally. We're still investigating the root cause of this issue.

14:30 UTC we're investigating the issue again, since jobs in AWS EU stack are slowed down

15:00 UTC we have found a root cause of incident and preparing fix to mitigate the issue. Next update in 1 hour.

16:00 UTC we are still working on issue mitigation. Next update in 1 hour.

16:55 UTC issue was fixed, all systems are stable now. There is several projects which still have problem to list jobs, we are working to fix this issue. Next update when new information's will be available.

Failing sync actions in all stacks

11:40 UTC - We are investigating failing synchronous actions (check database credentials and similar) in all our stacks since 11:00 UTC.

11:50 UTC - We deployed previous version of affected service, all systems are now operational. We apologize for any inconvenience caused.

Failing jobs in AWS US

23:40 UTC: We are experiencing jobs failures in AWS US stack when importing data to or from storage. It is caused by an incident in Snowflake, see https://status.snowflake.com/incidents/6d594mbq4v93

00:25 UTC: We are not experiencing the errors anymore, although the Snowflake hasn't closed the incident, see the last status update: "We've identified the source of the issue, and we're developing and implementing a fix to restore service."

(Resolved) 00:40 UTC : After further monitoring we don't see any errors when importing data to or from storage, hence we consider platform operational. 

Extractor Microsoft SQL Server internal error across all stacks

The latest version (8.2.0) of Microsoft SQL Server Extractor terminates with an internal error. This version was deployed yesterday, and we are currently performing a rollback Next update will be available in 15 minutes.

[Resolved] UTC 07:56: We have rollback to version 8.1.1, and the extractions are now functioning without any issues. We apologize for any inconvenience caused.

Jobs delayed on US stack

UTC 14:05: We're investigating delayed jobs starts in AWS US stack https://connection.keboola.com/. Jobs are "stuck" in "created" state.

UTC 14:33: The incident is now resolved, jobs are now starting normally. 


Storage jobs waiting in AWS US

UTC 22:30: We are investigating too many storage jobs waiting to be executed. Next update in 30 minutes.

UTC 23:00: The too many storage jobs waiting seem to be only in one particular project, not affecting the whole platform. Still we continue investigation. Next update in 30 minutes.

[Resolved] UTC 23:40: We mitigate the jobs creation in the affected project, double checked the consequences and conclude the platform is operational. 

Buffer API outage in us-east region

UTC 12:30 We're investigating issues with https://buffer.keboola.com/v1 endpoint in us-east-1 region. 

UTC 13:03 The internal database got overloaded, we're working on scaling it up and processing the backlog. We expect that the endpoint would be restored within an hour.

UTC 14:23 The restore is unfortunately taking longer than expected. We're still working on it.

UTC 14:50 The restore is taking longer because insufficient compute capacity of particular instance types in AWS. We're still working on it.

UTC 15:35 The endpoint is partially operational, but not fully replicated. We're still working on it.

UTC 15:56 The endpoint is operational and should be stable now.

Telemetry: Missing credits for writer jobs in projects recently migrated to the new queue [resolved]

We have discovered that some writer jobs in the projects that were migrated to the new job queue (Queue V2) after the beginning of May are missing information about the data transferred. That information is used to calculate the number of credits consumed by those jobs.

We will deploy a fix tomorrow (10th Aug), which will add missing credits to the jobs affected. For affected projects regularly using writers, the result may be that they have a higher recorded consumption of credits.

The issue is related solely to the telemetry and does not affect Keboola Connection in any way. Moreover, it affects the telemetry only for projects that were recently migrated to Queue V2.

UPDATE 2023-08-10 11:04 UTC: The fix was deployed and the affected writer jobs show consumed credits again.

Detailed description of the issue

When a project is migrated to Queue V2, any jobs created in the past several months are also migrated, so that the user can keep track of what is going on in their Keboola project UI. Jobs in both Queue V1 (the old queue) and Queue V2 contain information about the data transferred by these jobs as different metrics. However, this information is not passed from an original job to the corresponding migrated one during the migration process.

Generally, Queue V1 jobs take precedence over Queue V2 jobs. To prevent any issues, they are used in the telemetry calculations, rather than the migrated jobs, as they have the original data.

In May, to speed up the telemetry calculations, the input mapping of Queue V1 jobs in a transformation was switched so that only data updated in the last 30 days was incrementally loaded for further processing.

As noted above, when a project was migrated to Queue V2, migrated jobs were also created. So, when processing jobs, loads of migrated jobs from the past several months were processed but only recently updated Queue V1 jobs (from the last 30 days) were processed alongside them. Thus, the older Queue V1 jobs could not take precedence over the newer migrated Queue V2 jobs, so the latter were incorrectly used for the telemetry output. Hence, information is missing about transferred data, resulting in no credits.

For the bug fix, a transformation will now always load the entire history of Queue V1 jobs to prevent migrated jobs from incorrectly being used in telemetry calculations.