Snowplow is an Open Source event data collection platform for those who manage the warehousing and collection of data (e.g. data teams) across all of their channels and platforms, in real-time enabled by a data pipeline. Snowplow can collect event data or raw clickstream data from the front-end of an application (which is being configured with some trackers and webhooks). Snowplow has the significant power to track and record events on Cloud infrastructure.
Snowplow Analytics provides an event analytics platform, and it enables the clients to collect customer-level, event-level, and granular data from multiple platforms, including mobile and web, and load that data into structured data stores (e.g., RedShift) to support advanced data analytics.
Snowplow data pipeline uses an architecture which allows user to choose what parts you want to implement. An important object is an event, which is not limited to any clickstream data or web-page views only but can be pretty much anything. Every event is processed via multiple stages of the pipeline.
The below architecture shows every component which can be used in creating the Snowplow data pipeline:
Explanation of different components used to create Snowplow Analytics data pipeline are below:
Need to follow the below steps to set up a data pipeline for Snowplow Analytics
Create a free AWS account that gives you access to the free tier of services, some of which will help keep costs down for this pipeline too.
After this, the first thing to do which helps in creating the pipeline is to create an identity and account management (IAM) user.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"acm:*",
"autoscaling:*",
"aws-marketplace:ViewSubscriptions",
"aws-marketplace:Subscribe",
"aws-marketplace:Unsubscribe",
"cloudformation:*",
"cloudfront:*",
"cloudwatch:*",
"ec2:*",
"elasticbeanstalk:*",
"elasticloadbalancing:*",
"elasticmapreduce:*",
"es:*",
"iam:*",
"rds:*",
"redshift:*",
"s3:*",
"sns:*"
],
"Resource": "*"
}
]
}
To install and run the collector, you need to launch an EC2 instance. Once the instance will be launched, you need to install Java and Git client into this to proceed with the collector installation process.
There are three types of collectors, which are as below:
As the other two are deprecated so we will install the Scala Stream collector.
Install all the dependencies
There are two ways of installing scala stream collector:
We will use 1st way i.e. using Java. In this way, we will have to install the jar file. You can either download the jarfile directly from this URL or you can compile from source!
Alternatively, you can build it from the source files. To do so, you need to install scala and sbt, please follow the below steps:
$ apt-get update && apt-get install default-jdk
$ wget www.scala-lang.org/files/archive/scala-2.11.8.deb
$ dpkg -i scala-2.11.8.deb
$ scala -version
$ echo "deb https://dl.bintray.com/sbt/debian /" | tee -a /etc/apt/sources.list.d/sbt.list
$ apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
$ apt-get -y update && apt-get install -y sbt
$ git clone https://github.com/snowplow/stream-collector.git
$ sbt "project *targeted platform*" assembly
where targeted platform can be: kinesis, google-pubsub, kafka, nsq, stdout. I have set up collector on AWS infrastructure hence targeted platform is stdout in my case.
The jar file will be saved as snowplow-scala-collector-[targeted platform]-[version].jar in the [targeted platform]/target/scala-2.11 subdirectory – it is now ready to be deployed.
Download a template configuration file from GitHub: config.hocon.sample and create a .conf file using this template. Follow this link to configure different options available with scala stream collector.
$ java -jar snowplow-stream-collector-[targeted platform]-[version].jar --config [config-file].conf
Now, collector should be running and listening to events on public-ip address of EC2 instance. Pinging the collector on the /health path should return a 200 OK response.
$ curl [server-ip-address]/health
Snowplow collector receives data from the snowplow trackers. They generate event data and send that data to Snowplow collectors to be captured. The most popular Snowplow tracker to-date is the JavaScript Tracker, which is integrated in websites (either directly or via a tag management solution) the same way that any web analytics tracker (e.g. Google Analytics or Omniture tags) is integrated.
Snowplow allows you to collect events via the webhooks of supported third-party software.
Webhooks allow this third-party software to send their own internal event streams to Snowplow collectors to be captured. Webhooks are sometimes referred to as “streaming APIs” or “HTTP response APIs”.
Note: once you have setup a collector and tracker or webhook, you can pause and perform the remainder of the setup steps later. That is because your data is being successfully generated and logged. When you eventually setup enrich, you will be able to process all the data you have logged since setup.
Setup your first tracker e.g. PHP tracker, Python Tracker, Java Tracker, etc.
The Snowplow enrichment process processes raw events from a collector and
Once you have setup Enrich, the process for taking the raw data generated by the collector, cleaning, and enriching it will be automated.
Most Snowplow users store their web event data in at least two places: S3 for processing in Spark (e.g. to enable machine learning via MLLib) and a database (e.g. Redshift) for more traditional OLAP analysis.
The RDB Loader is an EMR step to regularly transfer data from S3 into other databases e.g. Redshift. If you only wish to process your data using Spark on EMR, you do not need to setup the RDB Loader. However, if you would find it convenient to have your data in another data store (e.g. Redshift) then you can set this up at this stage.
Once your data is stored in S3 and Redshift, the basic setup is complete. You now have access to the event stream: a long list of packets of data, where each packet represents a single event. While it is possible to do analysis directly on this event stream, it is common to:
We call this process data modeling.
Now that data is stored in S3 and potentially also Redshift, you are in a position to start analyzing the event stream or data from the derived tables in Redshift if a data model has been built. As part of the setup guide, we run through the steps necessary to perform some initial analysis and plugin a couple of analytics tools, to get you started.
This article helps you to collect, process, store and analyze your event data.
Snowplow is technically designed to:
You can collect event data using Collector and Webhooks, process this data using Enrich, and store it into data stores like S3.
Also, you can model your data into RedShift and Analyze it very efficiently.