Degraded Snowflake Performance (EU region) - March 16, 2020

We are investigating decreased performance of Snowflake in EU region which unfortunately reoccured after previous resolution. We are in touch in with Snowflake support. Next update in three hours.

UPDATE Mar 16, 11:58 CET  - Snowflake confirmed the issue and will fix it by end of business day today. Meanwhile backlog of jobs cleared but we still see performance degradation. Next update in 6 hours or when the issue will be resolved by Snowflake.

UPDATE Mar 16, 18:21 CET  - Snowflake is performing update which should resolve the issue. We already see performance improvements and there are no jobs in backlog. Next update tomorrow morning.

UPDATE Mar 17, 08:08 CET  - All operations are back to normal all jobs were processed during the night without any delays. We are working together with Snowflake to avoid these issues happening again.

Degraded Snowflake Performance (EU region) - March 2, 2020

We're experiencing degraded Snowflake performance affecting all operations in the EU region.

This problem can affect our users in the following ways:

  • Job execution time may be longer than usual.
  • Jobs are randomly failing with the error message "Result not found".

This is similar to the issue we observed last week, which was caused by the Snowflake Cloud Service Layer in EU region being overloaded.

We have reported it to Snowflake support and will keep you posted as soon as we have an update.

UPDATE Mar 3, 10:43 CET - The performance of Snowflake operations has currently improved. We're still working with Snowflake support on mitigating this issue permanently.

UPDATE Mar 4, 19:50 CET - We were hit again today with degraded performance of Snowflake's Cloud Service Layer. Between 3:30 PM - 5 PM CET we observed a large increase in compilation time of Snowflake queries. After 5 PM it returned back to normal. We updated Snowflake with our latest observations and are hopeful for a quick resolution. We will update here with any more information within the next 12 hours.

UPDATE Mar 5, 06:35 CET  - We hit again Snowflake performance degradation in last six hours. Snowflake is working on issue mitigation with highest severity. We will provide an update in three hours.

UPDATE Mar 5, 09:49 CET  - Performance degradation peaks still occurs. We are in touch with Snowflake on resolution. We will provide an update in three hours.

UPDATE Mar 5, 13:43 CET  - There has been no change of performance for the past few hours and its degradation still occurs. We will provide an update in three hours.

UPDATE Mar 5, 16:18 CET - We've been working with Snowflake support on mitigating the performance degradation. From our monitoring it seems that Keboola platform in EU region is now stabilised (since around 14:00). We will provide next update in three hours. 

UPDATE Mar 5, 16:43 CET - The issue is a highest priority case with Snowflake support. We're closely collaborating on finding a permanent solution to the performance degradation. Status update from Snowflake: 

Snowflake acknowledges the performance degradation issue reported by Keboola Czech SRO since February 24, 2020 and tracks it under the case #00097987 as a critical incident.

Snowflake Support together with Snowflake Engineering are continuously collaborating and conducting an in-depth investigation to identify the root cause and provide a viable solution to you on priority.

At this time, additional resources have been allocated in the Snowflake Services Layer which should show improved performance values.

We will provide next update in three hours. 

UPDATE Mar 5, 20:55 CET - The performance is now stabilised and jobs are running as expected. We are monitoring the situation and We will provide next update in 12 hours.

UPDATE Mar 6, 09:23 CET - The performance has worsened slightly during the night job peak. We are monitoring the situation and  will provide next update in 3 hours.

UPDATE Mar 6, 12:55 CET - The Snowflake Engineering team made some optimizations of the Cloud Service Layer for our account.  Since 7:40 AM CET there have been no "Result not found" errors and job performance should return to normal state.

We will continue monitoring the situating and will provide another update in 6 hours.

UPDATE Mar 6, 21:45 CET - All operations are back to normal. Thanks for your patience and understanding.

Degraded Snowflake Performance (EU region)

We're experiencing degraded Snowflake performance affecting all operations in the EU region.

We are investigating this issue and will keep you posted as soon as we have an update.


UPDATE Feb 24, 17:07 CET - After more detailed investigation there is a rapid increase of SQL compilation time of queries on snowflake side. It was reported to Snowflake Support and we are waiting for response from their side.

UPDATE Feb 25, 9:00 AM CET - We have confirmation from Snowflake support about high workload in their Cloud Service Layer on yesterday. Their engineering team taken steps mitigate the issue.

UPDATE Feb 25, 2:05 PM CET - After a detailed investigation we found that job performance was impacted from between 8 AM and 6 PM CET yesterday. We are still monitoring the situation. Please contact our support if your jobs are still having any problems with performance.


Week in Review — February 19, 2020

New Components

  • Mailgun v2 — Mailgun is an email automation service. It offers a complete cloud-based email service for sending, receiving and tracking email sent through your websites and applications.
  • Drive CX — Downloads Drive CX data about locations, purchase details, employees, and surveys.
  • AzureML Model Deployment — Allows you to deploy your trained model to Azure Container Instances and query it via an API.
  • MS SharePoint Lists writer — SharePoint empowers teamwork with dynamic and productive team sites. A list in SharePoint is a collection of data that gives you and your co-workers a flexible way to organize information.

New Features

  • Orchestrator supports setting the timezone over the API (support in Keboola Connection's UI will follow soon).
  • MongoDB extractor supports the SRV protocol for specifying connection, the so called seed list.
  • Snowflake Data Warehouse Manager now creates schemas with "MANAGED ACCESS" and grants "ON FUTURE" to all objects of the schema. This means that recreating tables (drop and load) doesn't require re-granting for all roles/users previously granted.
  • ThoughtSpot writer now checks if a database and a schema exist.

UI Improvements

  • When creating new Input mapping in a Snowflake transformation, the load type Clone is used if the source table is cloneable and bigger than 100 megabytes.
  • JSON editor now shows all fields, not only the required ones.
  • Transformation detail page warns about unsaved queries/scripts if you attempt to run a transformation or leave/close a page.

  • Orchestration detail page supports the load of older orchestration jobs.

  • In Transformation, you can see the basic Input Mapping setting without the opening detailed modal.

  • Create buttons (table, column) in the Storage explorer were moved to the top right corner.

  • Transformation Output Mapping modal is now more compact.

  • Organization selector shows the Maintainer name ("KBC Internal" in the example below).

  

Minor Improvements

  • Python environment is 3.8.1 and includes Java and H2O.
  • Jupyter notebooks now save notebook files to Keboola Storage — both on a manual and auto save.
  • Input mappings that are using CSV files(e.g. python/r transformation components) now have a global limit of 100 GB per table.
  • GoodData extractor uses more verbose logging.
  • Most emails from Keboola Connection have a new design.

Broken Loads from 2020-01-28 to 2020-01-29 [post-mortem]

Summary

On 2020-01-28 09:00 UTC, we deployed a version of Keboola Connection containing a bug. It resulted in loads from transformations to storage were missing our internal _timestamp value. This issue was hard to detect and persisted till 2020-01-29 08:00 UTC. Backfill was applied and all missing _timestamp fields were set to value 2020-01-29 00:00:00 UTC at 2020-01-31 16:30 UTC.  The effect of the tables not having the _timestamp set was that jobs which used this table for incremental loading had no reference for the newest data.

What Happened?

There was an error in our upgrade of the library responsible for loads. An incorrect parameter set resulted in timestamps not being set during load. Such a scenario was not covered by our tests, and this situation was not caught during our peer review process. We immediately deployed the previous functioning version of Keboola Connection as soon as the problem was identified. That itself took about 15 minutes. This was an issue that affected some customers' data so backfill was carefully discussed and tested. Unfortunately, we were also impacted with an issue in the 3rd party build system we use which prevented us from performing the backfill of the missing timestamps on 30th January. Finally, between 2020-01-31 09:30 UTC and 2020-01-31 16:30 UTC all impacted project were back filled.

Timetable

  •  2020-01-28 09:00 Version containing a bug deployed
  •  2020-01-29 08:00 Rollback
  •  2020-01-29 Investigation of issue, impact assessment
  •  2020-01-30 Testing of backfill
  •  2020-01-31 09:30 Start data backfill
  •  2020-01-31 16:30 Data backfill done

What Are We Doing About It?

We're extending the software tests to include more scenarios including test of _timestamp presence on all types of load. We're also working on improving our public incident response to post more frequent updates. 

Original status of issue: https://status.keboola.com/investigating-problems-with-incremental-lods

If your data was affected with this issue, our backfill is not enough for your specific case and you are not in contact with our support yet, feel free to get in touch. Our professional services team will provide all necessary help. 

Week in Review — January 30, 2020

New Features and UI Improvements

  • There is a new ui picker that allows you to easily switch between projects and their organizations.

  • Detail of a transformation bucket was tweaked. Button for adding transformations is reachable better, phases and backends are better arranged and design of the list of transformations was united with similar lists in other components.

  • Sidebars across the whole application were united and polished.

  • Editing descriptions across the whole application was polished too.

  • Input mappings in a transformation detail can be reordered using drag & drop.

Updated Components

  • There is a new template for getting projects and tasks in detail within Asana Extractor.
  • Updated Typeform Extractor  to download data from Responses API instead of Data API

New Components

  • Microsoft Dynamics 365 Extractor and Writer by our KDS team allows downloading and sending data to this CRM tool.
  • Microsoft SharePoint Lists Extractor by KDS team allows downloading collections of data from SharePoint lists.
  • Azure Blob Storage Extractor and Writer by KDS team allows reading and writing data to Azure Storage similarly to our S3 Writer.
  • OneDrive Writer by Jakub Bartel which uploads files from Storage to OneDrive or SharePoint account.

Investigating problems with incremental loads

We are investigating problems in Storage which could lead to inappropriate processing of incremental loading and related functionality. We will keep you updated.

UPDATE Jan 29 9:45 CET - Problem is that incremental load from transformations is not setting _timestamp column. We had identified a release that was causing trouble and finished its revert. Unfortunately, already created table rows having the _timestamp with null values still won't be processed by incremental loading. You can fix this situation manually by running a full load. Now we are running tests to isolate the root cause and preparing global correction of its side effects.

UPDATE Jan 29 14:00 CET - The _timestamp values are written correctly for all newly added table rows since the morning release revert. So now the problem persists only with the rows added previously (ca during yesterday and today's night) which still contain empty timestamp values and so won't be processed e.g. by writers configured to incremental loads. (You can overcome it temporarily by setting them to full load.) Now we are working on backfilling missing timestamp values and so fixing the incremental functionality for these affected data.

UPDATE Jan 29 18:00 CET - The backfill of missing timestamps is being prepared and should be ready tomorrow.

UPDATE Jan 30 15:00 CET - The backfill mechanism is released. Now we are performing last checks before applying it to all affected data. 

UPDATE Jan 30 16:30 CET - Unfortunately, we are forced to leave the backfill for tomorrow. There is an incident in Travis CI which is slowing down our work the whole day and is still not resolved and we decided not to risk running it rashly during the night to be able to support it properly if anything goes wrong.

UPDATE Jan 31 10:30 CET - All preparations are done and preliminary test finished. We are just starting with the backfill.

UPDATE Jan 31 12:00 CET - The backfill is running and ca. 10 % of affected data is fixed.

UPDATE Jan 31 14:00 CET - The backfill is running and ca. 20 % of affected data is fixed.

UPDATE Jan 31 15:30 CET - The backfill is still running and ca. 25 % of affected data in US region and ca. 40 % in EU region are fixed.

UPDATE Jan 31 17:00 CET - The backfill is still running and ca. 70 % of affected data in US region and ca. 100 % in EU region are fixed.

UPDATE Jan 31 17:30 CET - The backfill is finally finished. All the affected data are fixed and have _timestamp column set to "2020-01-29 00:00:00". So that you should update your settings if you have incremental loads set to lower value then 3 days.

SQL Validation Outage

SQLdep API suddenly stopped working and because of that transformations validation and SQL analysis are not working properly. (You will probably get false negatives for all validation requests.)

We are investigating the issue and will let you know once we know more.

UPDATE: We've resolved the issue and the integration with the service is working again. It lied in a change of SQLdep API address which we had unfortunately missed.

Week in Review — January, 24th, 2020

Updated Components

Tableau TDE writer

  • will now retry when a request failed - this should greatly improve reliability as network glitch or temporary unavailability of remote server won't fail the job. Instead it will wait a little and retry the request. 

New Features

  • Every new Keboola Connection user will now get Guide mode, regardless if they were invited by current user or registered themselves. It will guide them hands-on though the basics of using Keboola Connection

Minor Improvements

  • In Transformation detail, Backend and Phase can be changed from the sidebar

Server errors in US region

There was increased amount of server errors in US region that resulted in issues in Keboola Connection UI. No jobs were affected. 

Jan 22 15:50 UTC - increased amount of server errors

Jan 22 16:08 UTC - the issue is mostly resolved

Jan 22 16:19 UTC - the issue is completely resolved

Jan 22 18:28 UTC
 - the post was amended, original version mistakenly mentioned EU region was affected while actually this was an issue in the US region.