Failing Facebook/instagram components on Azure North Europe stack

2023-03-14 14:15 UTC - Since Sunday, we have been experiencing component failures when communicating with the Facebook Graph API on our Azure North Europe stack

Following components failing with application error:


We believe that issue is related to following reported bug in Facebook API

We are monitoring situation.

2023-03-20 08:00 UTC -  We have not received any update from Meta, last error occurred more than 72 hours ago, we will monitor situation. For now the situation looks stable and we are considering issue as resolved.

Jobs outage on (us-east-1)

2023-03-07 14:37 CET - We are investigating problem with jobs on (us-east-1 stack). 

2023-03-07 14:42 CET - We have identified a problem with one of our internal databases, containing metadata about jobs. As a result, no jobs can be run since 14:30 CET, and the rest of the platform may be behaving abnormally.

2023-03-07 15:05 CET - The database that was affected has been fixed, and operations should be running normally since 15:00 CET.

We apologize for any inconvenience caused and thank you for your understanding.

Limited service disruption for AWS US and EU stacks on March 21st and 22nd

Due to necessary database upgrades to our AWS US and EU stacks, a limited service disruption will take place on March 21st and 22nd.

  • On Tuesday March 21st at 12:00 pm UTC, the disruption will begin for AWS EU, and
  • on Wednesday March 22nd at 10:00 am UTC, it will begin for AWS US.

We anticipate that the limited service disruption will take approximately 15 minutes, but it should not exceed 60 minutes. Hopefully, this will be resolved before you return from your lunch or coffee break.

During this period, Storage jobs, Queue v1 and Orchestration (in projects with Queue v1) processing will stop, and new jobs will be delayed until the upgrade is completed. All running jobs will be cancelled, but will resume after the upgrade.

All APIs and other unaffected services, such as Workspaces and Queue v2 jobs, will remain operational, though their operations may be delayed due to the Storage job delays.

We apologize for any inconvenience caused and thank you for your understanding.

Hidden Transformations v2 configurations in UI

2023-03-01 16:00 CET - We are investigating hidden "Transformation v2" configurations on the UI. The next update in 15 minutes or when more info is available.

2023-03-01 16:10 CET - We have identified the root cause and prepared a fix which will be deployed within 10 minutes

(Resolved) 2023-03-01 16:24 CET - The fix has been deployed and transformations v2 are no more hidden in the UI. We advise users to reload their browsers as this was an UI issue.

Job failures in AWS EU stack

2023-02-20 15:20 UTC - A small number of jobs on the stack either ended by timeout or with a "Component terminated. Possibly due to out of memory error" error message during a recent incident between Feb 19 15:10 UTC and Feb 20 14:00 UTC due to an underlying node failure. We're actively investigating the cause and taking measures to prevent this from happening again. 

2023-02-20 15:56 UTC - The incident has been resolved, with the last occurrence of the error happening at Feb 20 14:35 UTC. We are continuing to monitor the situation closely to prevent any reoccurrence. 

Failing jobs on all stacks

2023-02-10 09:20 UTC - We are currently investigating the problem of failing jobs on all stacks that occurred on 2023-02-09 08:48 UTC. The error is manifested by the error message "K8S request has failed: events is forbidden: User "system:serviceaccount:job-queue-jobs:daemon-service-account" cannot list resource "events" in API group "" in the namespace "job-queue-jobs"".

UPDATE 09:41 UTC: We have identified the problem and rolled back previous version of our service. All services are now operating normally.

UPDATE 10:35 UTC After a deeper research we found that this problem affected only a small fraction of the jobs.

We're sorry for this inconvenience. 

Storage jobs restarts

2023-02-09 10:07 - We are currently investigating storage job restarts that occurred on 2023-02-09 07:35 UTC and 2023-02-07 08:04 UTC. These restarts have caused longer job run times or errors such as "table already exists" during transformation executions. We will provide another update when new information is available.

2023-02-09 10:57 - We have identified the root cause. We will deploy a fix within two hours, which might cause another occurrence of these restarts for some jobs.

2023-02-09 13:53 - We have deployed a fix at 13:20 UTC which caused the last occurrences of restarts. The issue is now resolved and you should not experience any more job restarts.

Templates & Keboola CLI errors

10:50 UTC Due to recent changes in Storage API, the Templates API and Keboola CLI are returning errors in multiple situations since approximately 9:00 UTC. As a result, you might see unexpected errors when working with the Keboola CLI or when trying to apply templates. We're working on the fix, which is expected to be released today ETA 15:00 UTC.

13:05 UTC Issue on Storage API was fixed. All services are now operating normally. We apologize for any inconvenience this may have caused.