Telemetry: Missing credits for writer jobs in projects recently migrated to the new queue [resolved]

We have discovered that some writer jobs in the projects that were migrated to the new job queue (Queue V2) after the beginning of May are missing information about the data transferred. That information is used to calculate the number of credits consumed by those jobs.

We will deploy a fix tomorrow (10th Aug), which will add missing credits to the jobs affected. For affected projects regularly using writers, the result may be that they have a higher recorded consumption of credits.

The issue is related solely to the telemetry and does not affect Keboola Connection in any way. Moreover, it affects the telemetry only for projects that were recently migrated to Queue V2.

UPDATE 2023-08-10 11:04 UTC: The fix was deployed and the affected writer jobs show consumed credits again.

Detailed description of the issue

When a project is migrated to Queue V2, any jobs created in the past several months are also migrated, so that the user can keep track of what is going on in their Keboola project UI. Jobs in both Queue V1 (the old queue) and Queue V2 contain information about the data transferred by these jobs as different metrics. However, this information is not passed from an original job to the corresponding migrated one during the migration process.

Generally, Queue V1 jobs take precedence over Queue V2 jobs. To prevent any issues, they are used in the telemetry calculations, rather than the migrated jobs, as they have the original data.

In May, to speed up the telemetry calculations, the input mapping of Queue V1 jobs in a transformation was switched so that only data updated in the last 30 days was incrementally loaded for further processing.

As noted above, when a project was migrated to Queue V2, migrated jobs were also created. So, when processing jobs, loads of migrated jobs from the past several months were processed but only recently updated Queue V1 jobs (from the last 30 days) were processed alongside them. Thus, the older Queue V1 jobs could not take precedence over the newer migrated Queue V2 jobs, so the latter were incorrectly used for the telemetry output. Hence, information is missing about transferred data, resulting in no credits.

For the bug fix, a transformation will now always load the entire history of Queue V1 jobs to prevent migrated jobs from incorrectly being used in telemetry calculations.

Jobs not starting in EU region

2023-07-25 15:30 UTC - We are investigating jobs failing to start in EU region. Next update in 30 minutes.

2023-07-25 16:00 UTC - We still continue investigating the issue. Next update in 30 minutes.

2023-07-25 16:30 UTC - We still continue investigating the issue. Next update in 30 minutes.

2023-07-25 17:00 UTC (Resolved) - We found running jobs in a disabled project to cause the other jobs failing to start and took immediate actions to resolve the problem. Jobs are now starting and platform is operational. We will investigate more to find the root cause.

New workspaces disappearing from the list of workspaces

UTC 10:30 We have confirmed that new and old sandboxes are now being correctly displayed. There is a chance that there workspaces created between approximately July 13 10:40 UTC - July 14 10:00 UTC that may still be invisible. If you are missing a workspace in your workspace list, please contact us through support, where we'll fix these cases individually. We sincerely apologize for the trouble.

UTC 8:25 We're working on a fix, we expect it to be ready in approximately 2hours. Next update in 2 hours.

UTC 7:20 We have identified the cause and working on a fix. As a workaround, you can create a workspace in Development branch, where it should display correctly. We have confirmed that this is only an issue with listing the workspaces, so no data is lost. Next update in 1 hour.

UTC 6:40 We're investigating reports of users not being able to new or recently created workspace in list of workspaces. Preliminary results show that this is only an issue with the listing, the workspaces do actually exist. Next update in 30 minutes.


Increased error rate in projects using Queue V1 & Workspaces

UTC 10:30 We're seeing again increased number of errors, this time these are reported as "Cannot import data from Storage API: Request body not valid". The first occurrence of this error is 9:40 UTC. We're investigating the details. Next update in 20 minutes.

UTC 10:40 We have identified the approximate cause. Only jobs in projects not using QueueV2 are affected and Workspaces in all projects could've been affected. We're working on a fix. Next update in 20 minutes.

UTC 10:58 The fix was deployed, the issue is now resolved. We apologize again for the inconvenience.

Increased error rate in all stacks

UTC 9:40: We're seeing reports of increased number of application errors in all stacks. It seems that mostly exporting tables is affected.

We're investigating the issue. Next update in 15 minutes.

UTC 9:55: The issue was caused by temporary internal inconsistency during deployment of one of our services. Approximately 30 jobs failed across all stacks. The issue is now resolved. We apologize for the konvencie.


Slow jobs start in AWS EU stack

July 10th, 12:20 UTC: We are investigating slow jobs start in AWS EU stack that we started to experience since 5th of July at CET midnight time.

July 10th 13:30 UTC: We have implemented certain measures that we believe could mitigate the issue; however, we have not yet identified the root cause. We will continue to closely monitor the situation and conduct further investigation. The next update will be provided tomorrow (July 11th) or as soon as new information becomes available.

July 11th 11:33 UTC: We are still experiencing intermittent slow job starts during peak times, and our investigation is ongoing. The next update will be provided as soon as new information becomes available.

July 13th 10:46 UTC: At 09:45 UTC, we deployed multiple optimizations to address and reduce job start delays. We will continue to closely monitor the situation, and we will provide the next update as soon as new information becomes available.

July 14th 06:34 UTC: Significant improvements have been achieved since the previous deployment, restoring performance to pre-July 5th levels. We continue to monitor the situation closely to maintain stability. Thank you for your patience and support.

July 17th 06:44 UTC: Performance is back to pre-July 5th levels, the issue is now resolved. We apologize for any inconvenience caused.

Telemetry - Data issue in kbc_usage_metrics_values table

After last telemetry update, incremental processing of kbc_usage_metrics_values table might caused showing of higher credits usage for some projects and usage breakdowns.

The data have been processed in full to ensure wrong records are fixed or removed.
The Telemetry Data Extractor has been switched to force full load till Monday 17th July, so the data are fixed in projects using this component with incremental load.

Project UI not loading

2023-07-04 11:15 UTC We are investigating problems with UI loading on all stacks.

2023-07-04 11:35 UTC Project UI is now working. Root cause was bug in UI deployment.

We apologize for any inconvenience caused.

Jobs list failures

2023-06-30 12:20 UTC We are investigating problems with listing jobs on all stacks. The error is manifested by the Invalid configuration for path \"job.branchType\": BranchType must be one of dev, default message.

Next update in 30 min

2023-06-30 12:45 UTC [resolved] We have re-deployed the last functional version and the problem is now solved.

We apologize for any inconvenience caused.

Storage job failures in the AWS EU stack

We are observing an increased number of faulty storage jobs, resulting in the error message "Cannot import data from Storage API“ in connection.eu-central-1.keboola.com. The main cause has been identified and resolved, and now all systems should be running smoothly. We will continue to monitor the situation, and the next update will be provided in 30 minutes.

We apologize for any inconvenience caused.

UPDATE 7:20 UTC [resolved] All systems are functioning normally, and the incident has been resolved and closed.