Week in Review -- December 12, 2016

Howdy everybody, 

Here's some highlights from this last week:

  • Performance of the new parallel uploading technique we mentioned last week saw a significant improvement.  We're now able to load 60GB from a foreign S3 bucket (outside of KBC) to a Snowflake storage bucket in approximately 33min.  And the only direction that's going is faster.
  • There are some newly available storage api endpoints for key/value metadata.  Feel free to check out the API documentation for further info.
  • There was a new blog post on our blog: Querying JSON ad-hoc in Snowflake
  • We've updated the JSON parser Chrome extension for Papertrail and it's available in the Chrome Web Store

Be sure to tune in again next week for more updates!


 

Week In Review -- December 5, 2016

Here's the list of most important changes we made last week:

- Docker runner supports headless csv: If columns are specified in the output mapping(in manifest file or in configuration object) then the corresponding csv file is considered to be without a csv header. More details

- Docker runner supports sliced output csv files: More than one csv file can be mapped to one output table, moreover all such files will be uploaded to storage in parallel. This way files bigger than 5GB can be uploaded to storage. More details

- Little cherry on top: Adding new table in Database writer UI now maps the storage table name to the database table name and not the table id:

Happy Monday, Have a great week ahead!


Generic Extractor Failures

We're encountering a series of "Found orphaned table manifest" errors in Generic Extractor. We have identified the root cause and reverting last changes to get it back to fully working state. We will restart all affected orchestrations.

We'll update this post when the fix will be deployed.

We're sorry for this inconvenience.

UPDATE 7:25pm CEST: The fix has been deployed to production, we're restarting all failed orchestrations.

Job failures

There were jobs failures between 10:30 AM - 12:50 PM caused by low disk space on one of the jobs workers.

We're sorry for this inconvenience.

YouTube Reporting API - Extractor update (v2)

It's my big pleasure to announce another major update on one of my first component I had ever build - YouTube Reporting API.

YouTube Reporting API offers a very simple way to download daily reports that belongs to a Content Owner of a Youtube Channel (in other words, your channel account must be authorised by Google if you want to download data from this API). These reports are generated by the defined jobs and all you need to do is to download these results. And this is something the extractor was build for. 

As the general process is very simple, the first version of this extractor was completed in such a short time. However, while we were using the extractor in the production, we found that (from time to time) Google triggers some background actions leading into generating reports which broke the original logic and produced incorrect result (caused by logic related to merge operations). And for that reason the first version of this extractor was not super useful for the production deployment.

However, based on that experience I really wanted to fix the problematic parts of the original version and turn this extractor into the project which is fun to use. And I simply believe I made it and I am extremely proud of what I achieved in this update.

You can read the full description in the documentation. In a nutshell, this extractor downloads reports generated by jobs. However, there are lots of extra features which help you to manage these downloads in a very convenient way. For example, the original configuration requirements implemented in the first version of this extractor was reduced significantly and there were also added several options for creating a backup (S3). But most importantly, all data should be downloaded correctly.

This writer is developed independently by Blue Sky Media. For more information on how to use the writer, please refer to the documentation. In case you run into some issues or you have more questions, please contact me directly. 

New Segment.io S3 extractor

Imagine you are building new web app. You want to measure all events that are in your app. Maybe you use Segment.io as a tool for sending those events to many destinations like Google Analytics, etc… Then you realise that you want to have all the data in one place (= Keboola Connection). How to send events from Segment.io to KBC?

The solution is simple. Just turn on the Segment.io S3 integration and all your events will be pushed to your own S3 bucket. Since the Segment - S3 integration uses a specific structure in S3 (each day has “folder” and logs are generated approx every hour into a separate file), we have developed custom Segment S3 extractor that will save you time and after simple configuration you can get all events into the KBC instantly.

The data is downloaded in JSON format which might be a bit tricky if you don’t use Snowflake backend. Anyway, if you do use Snowflake, the processing is just super easy and you can extract all data from JSON to columnar format and use in the ETL.

If you have any question contact support@bizztreat.com

Revision of Database Writers, new Impala Writer

We have released new versions of these database writers:

  • MySQL
  • Microsoft SQL Server
  • Redshift
  • Oracle

Also, we are introducing Cloudera Impala database writer.

All these writers are running on container based architecture with SSH tunnel support.

The Database writer and the old version of MSSQL writer are now marked as deprecated. We will continue to  support them for at least 3 months from now. After this period, we will migrate any remaining old configurations to the new versions.

We are now preparing a migration tool to help you migrate your existing configurations to the new versions.

If you have any questions or need help, please contact us at support@keboola.com.