How To Setup Snowplow Analytics on AWS
Setup Snowplow Analytics on AWS

Contents

Introduction

Snowplow is an Open Source event data collection platform for those who manage the warehousing and collection of data (e.g. data teams) across all of their channels and platforms, in real-time enabled by a data pipeline. Snowplow can collect event data or raw clickstream data from the front-end of an application (which is being configured with some trackers and webhooks). Snowplow has the significant power to track and record events on Cloud infrastructure.

Snowplow Analytics provides an event analytics platform, and it enables the clients to collect customer-level, event-level, and granular data from multiple platforms, including mobile and web, and load that data into structured data stores (e.g., RedShift) to support advanced data analytics.

Why Use Snowplow?

Snowplow data pipeline uses an architecture which allows user to choose what parts you want to implement. An important object is an event, which is not limited to any clickstream data or web-page views only but can be pretty much anything. Every event is processed via multiple stages of the pipeline.

The below architecture shows every component which can be used in creating the Snowplow data pipeline:

    Explanation of different components used to create Snowplow Analytics data pipeline are below:

    • TRACKERS: These are the libraries written in Python, Unity, Objective-C, JavaScript and others that allows you to send events to Snowplow Collector with one line of code.

    • WEBHOOKS: Using webhooks of third-party software, user can collect events. Webhooks allows third-party software to send their own internal event streams to Snowplow collectors to be captured. Sometimes, webhooks are also referred as “HTTP response APIs” or “streaming APIs”.

    • COLLECTOR: The collector receives the event-level data from the trackers which are configured on applications and stores it for processing. Later, if the pipeline is real-time, then the data is sent to enrich.
    • ENRICH: It processes the data which is stored by the collector at regular time-intervals (or in real-time).

    • STORAGE: The data which gets enriched will be loaded into the Storage e.g. Redshift, S3, etc.

    • DATA MODELING is a process to do analysis directly on this event stream which is common to join and aggregate event-level data into other smaller data sets. This process can be done once your data is stored in S3 and Redshift. You now have access to the event stream: a long list of packets of data, where each packet represents a single event.

    Advantages of Snowplow

    • Cost effective: The Snowplow is an open-source tool. Therefore, users only have to pay for the AWS infrastructure cost.
    • Highly customizable: The user can easily add custom trackers, webhooks, enrichments, data models, or metrics that suit your business.
    • Ownership of data: Using Snowplow, you have complete ownership of the data as it runs on the end-users stack under the AWS account of the user.

    Pre-requisites

    • An AWS account should be created.
    • A server should be up and running.
    • Git client – not necessary but can be helpful in cloning the Snowplow repo.
    • A domain name should be created whose DNS records you can modify.

    Steps to Setup Snowplow Analytics On AWS

    Need to follow the below steps to set up a data pipeline for Snowplow Analytics

    1. Create and setup AWS environment
    2. Install and configure Snowplow Collector e.g. Scala Collector
    3. Setup Tracker or Webhooks
    4. Setup Enrich
    5. Setup data storage e.g. Redshift
    6. Data modeling in Redshift
    7. Analyzing the data

    Step-1: Create and setup AWS environment

    Create a free AWS account that gives you access to the free tier of services, some of which will help keep costs down for this pipeline too.

    After this, the first thing to do which helps in creating the pipeline is to create an identity and account management (IAM) user.

    • In Services, select IAM
    • Select Groups in left-hand menu and create a new group.
    • Name your new group (e.g. zehncloud-snowplow-group)
    • Create a new Policy and paste the below JSON:
    • Name this policy e.g. zehncloud-snowplow-setup-infrastructure-policy
    • Attach the above policy to the newly created group.
    • Create a new user e.g. zehncloud-snowplow-setup.
    • Check the box next to Programmatic access.
    • Copy the Access Key ID and Secret Access Key and save somewhere safe.
    • Add this user to the above group.

    Step-2: Install and configure Snowplow Collector

    To install and run the collector, you need to launch an EC2 instance. Once the instance will be launched, you need to install Java and Git client into this to proceed with the collector installation process.

    There are three types of collectors, which are as below:

    • Scala Stream collector
    • Clojure collector  (deprecated)
    • Cloudfront collector (deprecated)

    As the other two are deprecated so we will install the Scala Stream collector.

    Install all the dependencies

    Installing the Scala Stream Collector

    There are two ways of installing scala stream collector:

    1. Using Java
    2. Using Docker

    We will use 1st way i.e. using Java. In this way, we will have to install the jar file. You can either download the jarfile directly from this URL or you can compile from source!

    Alternatively, you can build it from the source files. To do so, you need to install scala and sbt, please follow the below steps:

    • Install Java
    • Install Scala
    • Install sbt
    • Compile Scala stream collector from source using sbt and build an assembled jarfile with all the dependencies

    where targeted platform can be: kinesis, google-pubsub, kafka, nsq, stdout. I have set up collector on AWS infrastructure hence targeted platform is stdout in my case.

    The jar file will be saved as snowplow-scala-collector-[targeted platform]-[version].jar in the [targeted platform]/target/scala-2.11 subdirectory – it is now ready to be deployed.

    Configuration

    Download a template configuration file from GitHub: config.hocon.sample and create a .conf file using this template. Follow this link to configure different options available with scala stream collector.

    Run Collector

    Scala stream collector is a jar file. Run following command and provide configuration file as parameter.

    Now, collector should be running and listening to events on public-ip address of EC2 instance. Pinging the collector on the /health path should return a 200 OK response.

    Step-3: Setup Tracker or Webhooks

    Snowplow collector receives data from the snowplow trackers. They generate event data and send that data to Snowplow collectors to be captured. The most popular Snowplow tracker to-date is the JavaScript Tracker, which is integrated in websites (either directly or via a tag management solution) the same way that any web analytics tracker (e.g. Google Analytics or Omniture tags) is integrated.

    Snowplow allows you to collect events via the webhooks of supported third-party software.

    Webhooks allow this third-party software to send their own internal event streams to Snowplow collectors to be captured. Webhooks are sometimes referred to as “streaming APIs” or “HTTP response APIs”.

    Note: once you have setup a collector and tracker or webhook, you can pause and perform the remainder of the setup steps later. That is because your data is being successfully generated and logged. When you eventually setup enrich, you will be able to process all the data you have logged since setup.

    Setup your first tracker e.g. PHP tracker, Python Tracker, Java Tracker, etc.

    Step-4: Setup Enrich

    The Snowplow enrichment process processes raw events from a collector and

    1. Cleans up the data into a format that is easier to parse/analyze than it will be so applied on the same then it can use a directly Enriches the data (e.g. infers the location of the visitor from his / her IP address and infers the search engine keywords from the query string). 
    2. Stores the cleaned, enriched data

    Once you have setup Enrich, the process for taking the raw data generated by the collector, cleaning, and enriching it will be automated.

    Step-5: Setup data storage

    Most Snowplow users store their web event data in at least two places: S3 for processing in Spark (e.g. to enable machine learning via MLLib) and a database (e.g. Redshift) for more traditional OLAP analysis.

    The RDB Loader is an EMR step to regularly transfer data from S3 into other databases e.g. Redshift. If you only wish to process your data using Spark on EMR, you do not need to setup the RDB Loader. However, if you would find it convenient to have your data in another data store (e.g. Redshift) then you can set this up at this stage.

    Step-6: Data modeling in Redshift

    Once your data is stored in S3 and Redshift, the basic setup is complete. You now have access to the event stream: a long list of packets of data, where each packet represents a single event. While it is possible to do analysis directly on this event stream, it is common to:

    1. Join event-level data with other data sets (e.g. customer data)
    2. Aggregate event-level data into smaller data sets (e.g. sessions)
    3. Apply business logic (e.g. user segmentation)

    We call this process data modeling.

    Step-7: Analyzing the data

    Now that data is stored in S3 and potentially also Redshift, you are in a position to start analyzing the event stream or data from the derived tables in Redshift if a data model has been built. As part of the setup guide, we run through the steps necessary to perform some initial analysis and plugin a couple of analytics tools, to get you started.

    Summary

    This article helps you to collect, process, store and analyze your event data.

    Snowplow is technically designed to:

    • Give you ownership, access, and control of your own web analytics data (no lock-in)
    • Be loosely coupled or just extensible, so that it is very easy to add e.g. new trackers to capture any data from new platforms (e.g. mobile, TV) and put the data to new uses.

    You can collect event data using Collector and Webhooks, process this data using Enrich, and store it into data stores like S3.

    Also, you can model your data into RedShift and Analyze it very efficiently.