Snowplow is an Open Source event data collection platform for those who manage the warehousing and collection of data (e.g. data teams) across all of their channels and platforms, in real-time enabled by a data pipeline. Snowplow can collect event data or raw clickstream data from the front-end of an application (which is being configured with some trackers and webhooks). Snowplow has the significant power to track and record events on Cloud infrastructure.
Snowplow Analytics provides an event analytics platform, and it enables the clients to collect customer-level, event-level, and granular data from multiple platforms, including mobile and web, and load that data into structured data stores (e.g., RedShift) to support advanced data analytics.
Need to follow the below steps to set up a data pipeline for Snowplow Analytics
$ apt-get update && apt-get install default-jdk
$ wget www.scala-lang.org/files/archive/scala-2.11.8.deb
$ dpkg -i scala-2.11.8.deb
$ scala -version
$ echo "deb https://dl.bintray.com/sbt/debian /" | tee -a /etc/apt/sources.list.d/sbt.list
$ apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
$ apt-get -y update && apt-get install -y sbt
$ git clone https://github.com/snowplow/stream-collector.git
$ sbt "project *targeted platform*" assembly
$ java -jar snowplow-stream-collector-[targeted platform]-[version].jar --config [config-file].conf
$ curl [server-ip-address]/health
Snowplow allows you to collect events via the webhooks of supported third-party software.
Webhooks allow this third-party software to send their own internal event streams to Snowplow collectors to be captured. Webhooks are sometimes referred to as “streaming APIs” or “HTTP response APIs”.
Note: once you have setup a collector and tracker or webhook, you can pause and perform the remainder of the setup steps later. That is because your data is being successfully generated and logged. When you eventually setup enrich, you will be able to process all the data you have logged since setup.
Setup your first tracker e.g. PHP tracker, Python Tracker, Java Tracker, etc.
The Snowplow enrichment process processes raw events from a collector and
Once you have setup Enrich, the process for taking the raw data generated by the collector, cleaning, and enriching it will be automated.
Most Snowplow users store their web event data in at least two places: S3 for processing in Spark (e.g. to enable machine learning via MLLib) and a database (e.g. Redshift) for more traditional OLAP analysis.
The RDB Loader is an EMR step to regularly transfer data from S3 into other databases e.g. Redshift. If you only wish to process your data using Spark on EMR, you do not need to setup the RDB Loader. However, if you would find it convenient to have your data in another data store (e.g. Redshift) then you can set this up at this stage.
Once your data is stored in S3 and Redshift, the basic setup is complete. You now have access to the event stream: a long list of packets of data, where each packet represents a single event. While it is possible to do analysis directly on this event stream, it is common to:
We call this process data modeling.
Now that data is stored in S3 and potentially also Redshift, you are in a position to start analyzing the event stream or data from the derived tables in Redshift if a data model has been built. As part of the setup guide, we run through the steps necessary to perform some initial analysis and plugin a couple of analytics tools, to get you started.
This article helps you to collect, process, store and analyze your event data.
Snowplow is technically designed to:
You can collect event data using Collector and Webhooks, process this data using Enrich, and store it into data stores like S3.
Also, you can model your data into RedShift and Analyze it very efficiently.