MSSQL extractor errors

Yesterday we have released a new version of MSSQL extractor in which a bug was present with caused failed jobs - data doesn`t end up lining up with the headers.

UPDATE 04:12 UTC: We have rollbacked the previous version. All affected configurations should be working.

We sincerely apologize for the errors. A postmortem reports will follow with further details.

UPDATE 10:47 UTC: We'll be contacting all customers possibly affected by this error through support.

Job delays in AWS EU region

Apr 14 21:30 UTC: Backlog is cleared and all operations are back to normal. The incident is now resolved.

Apr 14 20:15 UTC: We mitigated the issue, backlog should be cleared in 20 minutes. We continue to monitor the situation. Next update in one hour or when new information will be available.

Apr 14 19:06 UTC: We've identified the problem and added more capacity for faster backlog processing. Next update in one hour or when new information will be available.

Apr 14 18:30 UTC: We are investigating job delays in AWS EU region. We are working on resolving the situation and keep you posted.

Job errors in Azure North Europe region

Apr 14 18:55 UTC We don't see any connectivity failure now, hence we expect this to be resolved. To sum it up, the failing connections happened between 18:00 and 18:15 UTC and caused higher job error rates and job delay in Azure north region.

Apr 14 18:25 UTC: The system seem to back to normal, however we continue to investigate the root cause. The preliminary investigations show there was a temporary network connection failure to one of our metadata databases that caused the increased jobs error rate.

Apr 14 18:15 UTC: We are investigating higher jobs error rate and job delay in Azure north region. We are working on resolving the situation and keep you posted.

Recurring jobs processing delays in EU stack in morning hours

The problem with delays and prolongation of jobs processing reported yesterday was happening this morning again. Apparently, the actions we already took were not sufficient. We are investigating the situation and let you know about the resolution.

UPDATE 14:15 CET We analyzed loads of workers processing component jobs and because of some suspicious activity, we restarted their instances to ensure they will be behaving correctly. We are going to monitor the situation the next morning to ensure that jobs performance is not affected.

UPDATE April 9 8:30 CET We were closely monitoring the situation in the morning and the platform seems to be back to normal. Please let us know if you still experienced any unexpected effects.

Jobs processing delays in EU stack

Yesterday morning (2021-04-06 ca between 5 and 9 CET) we noticed an overall increased number of jobs in the eu-central region that resulted in prolonged orchestrations or long waiting for jobs starts. The situation repeated today and we decided to add more workers to compensate for that. 

Also, yesterday there was unusual congestion in transformation jobs in the same region but that seems to be an anomaly that did not repeat earlier nor today. 

We are going to monitor the situation in the next days and take other measures if necessary. 

Python Workspaces Fail to Run in North Europe stack

Since 2021-03-25 12:30 UTC we are experiencing failures during starting of Python and Python MLflow workspaces in North Europe stack. We are working on resolving the situation and keep you posted.

UPDATE 13:30 UTC: We identified the source of the problem as the lack of hardware resources in the North Europe Azure datacenter. We are working on a switch to another type of computing instances that would not be affected by this shortage.

UPDATE 13:55 UTC: We switched the infrastructure to another type of computing instances and the workspaces are running again. We are still monitoring the situation.

Jobs performance degradation in EU and US stack

We are investigating slower jobs performance since Sunday night (2021-03-21), in EU AWS and US AWS stacks. We're seeing the issue only for particular projects and limited number of jobs. Most of the jobs are not affected. 

We are still investigating the causes. If you are experiencing a substantial slowdown of job processing in your project please contact us via the support in the project.

UPDATE 20:28 UTC: We have restarted all job worker nodes in US AWS and EU AWS stacks and monitoring the results. This should have no impact on currently running jobs.

UPDATE March 23, 2021, 10:00 UTC: After the restart all systems are operating normally, we don't see any unusual delays in running jobs. We continue to monitor the situation but unless the situation escalates we consider this incident resolved.

Authentication errors across multiple Microsoft services

March 15, 2021, 20:16 UTC We're seeing an authentication errors of multiple Microsoft or Azure related services. This affects all stacks and all regions. In AWS regions only some component jobs are affected (mainly PowerBI writer, OneDrive Excel Sheets writer, OneDrive Excel Sheets extractor). in Azure regions more services may be affected.

You may see application errors or increased job running times.

See https://status.azure.com/en-us/status for more details.

Next update in 60 minutes.

UPDATE March 15, 2021, 21:44 UTC - Engineers at Azure are currently rolling out mitigation worldwide. Full mitigation expected within 60 minutes.

UPDATE March 16, 2021, 0:16 UTC - Engineers at Azure have rolled out a fix to all affected regions for Azure AD. Internal telemetry and customer reports suggests that the error rate for dependent services is rapidly decreasing. Microsoft services are in the process of recovery. Some services may have varying times of recovery following the underlying fix. The next update will be provided at 8:00 UTC.

UPDATE March 16, 2021, 7:52 UTC - The issue related to Azure Active Directory has been resolved. Azure reports problems with authentication to the Storage Service now, but we don't see any impact on Keboola Connection. The next update will be provided at 12:00 UTC.

UPDATE March 16, 2021, 11:23 UTC - Issues on the Azure services are solved. The incident is now resolved.

Orchestration Notifications incident

2021-01-28 at 17:03 UTC We're investigating an issue with Orchestration notifications in EU and US AWS stacks. You may not be receiving notifications for failed or long running orchestrations.

2021-01-28 at 17:39 UTC Notifications are being sent again.

2021-01-28 at 18:30 UTC The issue is resolved. There were a total of 11 affected orchestrations. We sent tickets to all affected projects (unless the given orchestration was failing regularly).