Increase In Configuration Errors

2021-05-10 13:25 CET

We have noticed a slight increase in job failures for some components today since this morning's release of the job runner. 
We are investigating the root cause of the issue and will update when more information becomes available.

2021-05-10 14:35 CET

All systems are back to fully operational status.  We are continuing to monitor for further instances of this error, and we are working on a preventative measures plan to reduce the impact of this type of incident in the future.  

Failing Facebook Ads extractor

Since 2021-04-28 01:00 UTC we are experiencing Facebook ads extractor failures on error "Please reduce the amount of data you're asking for, then retry your request"

It is caused by a bug in Facebook API that have been reported and currently being investigate by a Facebook backend team.

Link to the Facebook bug ticket: https://developers.facebook.com/support/bugs/503443564145524/

We continue to watch the Facebook bug ticket and will update here once we know more.

What can be done now

If you have access to the Facebook bug report you can raise importance/severity of it by leaving a comment there.

One possible workaround that might work is to retrieve data with smallest window possible that is adding .date_preset(yesterday) parameter to the query, e.g:

insights.action_attribution_windows(28d_click).action_breakdowns(action_type).level(adset).date_preset(yesterday).time_increment(1)

Post-mortem: MSSQL extractor errors

This is a post-mortem of the MSSQL extractor errors incident.

We found a root-cause, PHP sorting function is not guaranteed to be stable. It is fixed in PHP 8.0 (https://wiki.php.net/rfc/stable_sorting), but we used 7.4 in the extractor (which is also still supported).

We have learned that in older versions of PHP, a sort function can randomly swap elements with the same value if there are more than 16 values. As the error did not take effect with the lower number of items, our tests did not find it. 

We've fixed the bug and added a tests for sorting more than 16 items.

Snowflake transformation errors in Azure North Europe

Since 2021-04-21 10:00 UTC we're seeing increased error rate in Snowflake connections. Users may experience failed Snowflake transformation jobs. We're investigating the root cause. 

UPDATE 18:00 UTC: The last error occurred on 2021-04-22 11:24 UTC and we haven't seen any further failures since then. All operations back to normal. We're in touch with Snowflake support to find the root cause and prevent this from happening in the future. 

We're sorry for this inconvenience and thanks for your understanding.

Increased API error rate in Azure North Europe

Since 2021-04-20 18:30 UTC we are experiencing increased error rate on all APIs in Azure North Europe. Our engineering team is working to identify the root cause. Next update in 1 hour.

UPDATE 08:45 UTC: We have restarted a faulty container and the situation seems to be stabilised. Next update in 1 hour. 

UPDATE 09:00 UTC: The increased error rate might have caused delays in job processing. 

UPDATE 09:50 UTC: The container does not show any further symptoms of the failure, all operations are back to normal. 

Snowflake Extractor Errors

We have discovered that that Snowflake Extractor can be suffering from a similar problem as the MSSQL extractor reported earlier.

The issue affects only configurations using custom query. In case the extracted table contains more then 16 columns, the data may have been swapped within the table columns. The issue was live between 19.4. 14:51 -  20.4. 11:17 UTC. 

The extraction itself does not manifest any error. It can only be discovered later in the pipeline. We're working on finding all affected tables and we'll be contacting you through our support system with possible repair options in case your project is affected.

We sincerely apologize for this problem. A postmortem will be published with the root cause once we resolve the data issues.


April 27, 2021, 06:50 UTC We found out that no job met the conditions for the error to take effect (even though the error was in the code and we have already fixed it).

Data loading errors when loading data to workspaces

April 20, 2021, 07:32 UTC We received reports about errors when loading data to workspaces. We're investigating. Next update in 15 minutes or sooner. 

April 20, 2021, 07:48 UTC Only AWS multitenant stacks are affected with the loading errors. We're still investigating. Next update in 15 minutes or sooner.

April 20, 2021, 08:03 UTC Only manual data loads from UI are affected, orchestrations and jobs work normally. Next update in 15 minutes or sooner.

April 20, 2021, 08:30 UTC We are working on a fix. Next update in 1 hour or sooner. 

April 20, 2021, 09:11 UTC The issue is resolved. You need to reload your browser window for the fix to load. 

MSSQL extractor errors

Yesterday we have released a new version of MSSQL extractor in which a bug was present with caused failed jobs - data doesn`t end up lining up with the headers.

UPDATE 04:12 UTC: We have rollbacked the previous version. All affected configurations should be working.

We sincerely apologize for the errors. A postmortem reports will follow with further details.

UPDATE 10:47 UTC: We'll be contacting all customers possibly affected by this error through support.

Job delays in AWS EU region

Apr 14 21:30 UTC: Backlog is cleared and all operations are back to normal. The incident is now resolved.

Apr 14 20:15 UTC: We mitigated the issue, backlog should be cleared in 20 minutes. We continue to monitor the situation. Next update in one hour or when new information will be available.

Apr 14 19:06 UTC: We've identified the problem and added more capacity for faster backlog processing. Next update in one hour or when new information will be available.

Apr 14 18:30 UTC: We are investigating job delays in AWS EU region. We are working on resolving the situation and keep you posted.

Job errors in Azure North Europe region

Apr 14 18:55 UTC We don't see any connectivity failure now, hence we expect this to be resolved. To sum it up, the failing connections happened between 18:00 and 18:15 UTC and caused higher job error rates and job delay in Azure north region.

Apr 14 18:25 UTC: The system seem to back to normal, however we continue to investigate the root cause. The preliminary investigations show there was a temporary network connection failure to one of our metadata databases that caused the increased jobs error rate.

Apr 14 18:15 UTC: We are investigating higher jobs error rate and job delay in Azure north region. We are working on resolving the situation and keep you posted.