Failing jobs in AWS US

23:40 UTC: We are experiencing jobs failures in AWS US stack when importing data to or from storage. It is caused by an incident in Snowflake, see https://status.snowflake.com/incidents/6d594mbq4v93

00:25 UTC: We are not experiencing the errors anymore, although the Snowflake hasn't closed the incident, see the last status update: "We've identified the source of the issue, and we're developing and implementing a fix to restore service."

(Resolved) 00:40 UTC : After further monitoring we don't see any errors when importing data to or from storage, hence we consider platform operational. 

Jobs delayed on US stack

UTC 14:05: We're investigating delayed jobs starts in AWS US stack https://connection.keboola.com/. Jobs are "stuck" in "created" state.

UTC 14:33: The incident is now resolved, jobs are now starting normally. 


Storage jobs waiting in AWS US

UTC 22:30: We are investigating too many storage jobs waiting to be executed. Next update in 30 minutes.

UTC 23:00: The too many storage jobs waiting seem to be only in one particular project, not affecting the whole platform. Still we continue investigation. Next update in 30 minutes.

[Resolved] UTC 23:40: We mitigate the jobs creation in the affected project, double checked the consequences and conclude the platform is operational. 

Buffer API outage in us-east region

UTC 12:30 We're investigating issues with https://buffer.keboola.com/v1 endpoint in us-east-1 region. 

UTC 13:03 The internal database got overloaded, we're working on scaling it up and processing the backlog. We expect that the endpoint would be restored within an hour.

UTC 14:23 The restore is unfortunately taking longer than expected. We're still working on it.

UTC 14:50 The restore is taking longer because insufficient compute capacity of particular instance types in AWS. We're still working on it.

UTC 15:35 The endpoint is partially operational, but not fully replicated. We're still working on it.

UTC 15:56 The endpoint is operational and should be stable now.

Jobs not starting in EU region

2023-07-25 15:30 UTC - We are investigating jobs failing to start in EU region. Next update in 30 minutes.

2023-07-25 16:00 UTC - We still continue investigating the issue. Next update in 30 minutes.

2023-07-25 16:30 UTC - We still continue investigating the issue. Next update in 30 minutes.

2023-07-25 17:00 UTC (Resolved) - We found running jobs in a disabled project to cause the other jobs failing to start and took immediate actions to resolve the problem. Jobs are now starting and platform is operational. We will investigate more to find the root cause.

New workspaces disappearing from the list of workspaces

UTC 10:30 We have confirmed that new and old sandboxes are now being correctly displayed. There is a chance that there workspaces created between approximately July 13 10:40 UTC - July 14 10:00 UTC that may still be invisible. If you are missing a workspace in your workspace list, please contact us through support, where we'll fix these cases individually. We sincerely apologize for the trouble.

UTC 8:25 We're working on a fix, we expect it to be ready in approximately 2hours. Next update in 2 hours.

UTC 7:20 We have identified the cause and working on a fix. As a workaround, you can create a workspace in Development branch, where it should display correctly. We have confirmed that this is only an issue with listing the workspaces, so no data is lost. Next update in 1 hour.

UTC 6:40 We're investigating reports of users not being able to new or recently created workspace in list of workspaces. Preliminary results show that this is only an issue with the listing, the workspaces do actually exist. Next update in 30 minutes.


Increased error rate in projects using Queue V1 & Workspaces

UTC 10:30 We're seeing again increased number of errors, this time these are reported as "Cannot import data from Storage API: Request body not valid". The first occurrence of this error is 9:40 UTC. We're investigating the details. Next update in 20 minutes.

UTC 10:40 We have identified the approximate cause. Only jobs in projects not using QueueV2 are affected and Workspaces in all projects could've been affected. We're working on a fix. Next update in 20 minutes.

UTC 10:58 The fix was deployed, the issue is now resolved. We apologize again for the inconvenience.

Increased error rate in all stacks

UTC 9:40: We're seeing reports of increased number of application errors in all stacks. It seems that mostly exporting tables is affected.

We're investigating the issue. Next update in 15 minutes.

UTC 9:55: The issue was caused by temporary internal inconsistency during deployment of one of our services. Approximately 30 jobs failed across all stacks. The issue is now resolved. We apologize for the konvencie.


Slow jobs start in AWS EU stack

July 10th, 12:20 UTC: We are investigating slow jobs start in AWS EU stack that we started to experience since 5th of July at CET midnight time.

July 10th 13:30 UTC: We have implemented certain measures that we believe could mitigate the issue; however, we have not yet identified the root cause. We will continue to closely monitor the situation and conduct further investigation. The next update will be provided tomorrow (July 11th) or as soon as new information becomes available.

July 11th 11:33 UTC: We are still experiencing intermittent slow job starts during peak times, and our investigation is ongoing. The next update will be provided as soon as new information becomes available.

July 13th 10:46 UTC: At 09:45 UTC, we deployed multiple optimizations to address and reduce job start delays. We will continue to closely monitor the situation, and we will provide the next update as soon as new information becomes available.

July 14th 06:34 UTC: Significant improvements have been achieved since the previous deployment, restoring performance to pre-July 5th levels. We continue to monitor the situation closely to maintain stability. Thank you for your patience and support.

July 17th 06:44 UTC: Performance is back to pre-July 5th levels, the issue is now resolved. We apologize for any inconvenience caused.