Delayed processing of job in AWS EU stack

2022-01-12 8:40 UTC We are experiencing number of jobs in waiting state more than usual. We are investigating the issue.

2022-01-12 9:05 UTC There is sudden increased traffic on our EU Snowflake warehouse so that we upgraded it to larger instance and the queued jobs were immediately processed. The delay should be fixed soon. We are still monitoring the warehouse until the traffic is settled down.

connection.eu-central-1.keboola.com maintenance

UPDATE 12:05 UTC - Previously announced maintenance of connection.eu-central-1.keboola.com will start in one hour at 13:00 UTC. During the maintenance, you can't access your data and projects. All network connections will be terminated by "HTTP 503 - down for maintenance" status message.

UPDATE 13:00 UTC - EU stack (connection.eu-central-1.keboola.com) maintenance started.

UPDATE 14:53 UTC - EU stack (connection.eu-central-1.keboola.com) is finished. Platform should be stable, we continue to monitor it.

connection.keboola.com maintenance

UPDATE 06:54 UTC - Previously announced maintenance of connection.keboola.com will start in one hour. During the maintenance, you can't access your data and projects. All network connections will be terminated by "HTTP 503 - down for maintenance" status message.

UPDATE 08:00 UTC - US stack (connection.keboola.com) maintenance started.

UPDATE 10:43 UTC - US stack (connection.keboola.com) is finished. Platform should be stable we continue to monitor it.

High API errors rate in AWS service

The AWS service in the US-EAST-1 region where we operate the AWS US stack is disrupted by networking connectivity issues of some instances in one availability zone. So far our service does not seem to be directly affected but it may reach some of its parts eventually. We are monitoring the situation and let you know in an hour about its progress.

UPDATE: The Keboola Academy site is down too most probably due to this outage.

UPDATE 14:30 CET: The problem was identified by the AWS team (a power outage in one data center in USE1-AZ4 availability zone) and they are already restoring the power and recovering from the problem. It seems that some other services like Slack and SolarWinds Papertrail were affected too but the Connection seems to be unaffected except for some short job processing delays. We are still monitoring the situation and let you know about the situation in an hour.

UPDATE 15:30 CET: Power to all affected instances and network devices was restored and recovery for the majority of EC2 instances and EBS volumes within the affected Availability Zone is being seen. The impact on the Connection stack should be almost zero by now.

AWS Stacks Maintenance Announcement

A maintenance of AWS stacks Keboola Connection will take place on Saturday, Jan 8th, 2022 and should take less than three hours.

During the maintenance, you can't access your data and projects. All network connections will be terminated by "HTTP 503 - down for maintenance" status message.

All running tasks will be monitored by us and restarted in case of any interruption. Orchestrations and running transformations will be generally delayed, but not interrupted. However, feel free to re-schedule your saturday's orchestrations to avoid this maintenance window.

Delayed processing of job in Azure North Europe stack

2021-12-16 17:40 UTC We are experiencing number of jobs in waiting state more than usual. We continue investigating the issue.

2021-12-16 18:45 UTC The issue has been resolved, everything is working as expected. 

2021-12-16 19:10 UTC Further investigation revealed the parallel config rows execution might have been affected leaving some jobs stuck. Please review your jobs run as a configuration in parallel, terminate such jobs if they seem to be stuck and run them again.

Log4j zero-day vulnerability update

Regarding the security issue (CVE-2021-44228) with the Log4j zero-day vulnerability, we have completed all necessary steps to investigate if our system had been compromised.

After a deep investigation, we can say that there were no security issues or breaches on our systems. We don't utilize Log4j for our main services.

We also checked all 3rd party services we are using, but thanks to our very strict security standards those services are not publicly accessible, they run in a separate environment (disconnected from customer's data), and cannot be used as an attack vector. We also haven't received any security issues from our SaaS partners.

We take the security of your data very seriously, so we applied additional threat detection regarding the Log4j security issue.

Please reach out if you have any questions.

Column, Table, and Bucket metadata overwritten – repair

We found a way to repair the overwritten column, table, and bucket user metadata, caused by the incident reported here: Column, table or bucket metadata possibly overwritten

The incident affected column, table, and bucket metadata that had two (or more) metadata with the same key but a different provider. If metadata had been updated for one provider, values were changed for all of them. This could have led to a rewrite of user-defined metadata for column type, length, or any other metadata. These metadata are used for input mapping. Existing mappings were not affected. But you may be facing a problem when you create a new input mapping and use any table with affected metadata that works in existing mappings. This may cause a problem with the newly created input mapping. As a temporary solution, you can reset this user-defined metadata for a data type manually to the correct value.

We will find all affected metadata and obtain the correct values by “replaying” update metadata storage events. For all user metadata we fix, we also update the time stamp. While repairing the metadata, we will disable a project for a short time (we expect seconds or a few minutes at most), during which you will be unable to use the project. We apologize for any inconvenience. In the following days, we will add a message (shown on the project dashboard) to the affected projects with the expected date when the process to repair corrupted metadata will start. 

Any changes to the metadata after the issue has been fixed (December 3, 9:03 UTC) will be also taken into account and will not be lost.

Corrupted telemetry data

Dec 9 2021, 11:38 UTC We are currently investigating an issue regarding corrupted data obtained via our Telemetry Data component (keboola.ex-telemetry-data).

Next update in 60 minutes.

Dec 9 2021, 13:07 UTC We have identified the issue in our telemetry data and fixed it. The issue might cause job with no existing configuration to not be assigned to its actual project telemetry data.

We have modified the component so that it now loads data using full loads only. To ensure that you have the correct telemetry data, all you need to do is run the extractor (or wait for your pipeline to run it). We will re-implement the incremental fetching in the following months.

We are very sorry for any inconvenience caused. 

High error rate in Developer Portal

Service disruption in AWS US are also causing problems in Devloper Porta (apps-api.keboola.com & components.keboola.com). You may see intermittent errors 5XX, refreshing the page can help. 

AWS acknowledge the service disruption and are active working towards recover. See https://status.aws.amazon.com/ for more details. Once the AWS service disruption is over our services should start running smoothly again. Next update in 60 minutes or when new information is available.

UPDATE 19:10 UTC Service disruption on the AWS US region persists. We continue to monitor the situation. Next update in 2 hours.

UPDATE 21:40 UTC Service disruption on the AWS US region are reducing. Our affected services are showing significant improvement. Next update in 12 hours or as new information is available.

UPDATE Dec 8th, 07:12 UTC Most services in the affected AWS region have already been repaired. Our services are operating normally. Next update in 4 hours or as new information is available.

UPDATE Dec 8th, 15:130 UTC We're sorry for the late update. AWS services have already recovered. Everything should be running without any issues now.