What were we trying to achieve?

We needed a simple, scalable way to get data from the front end of a client’s website, in to their data warehouse with minimal lag, without any extra infrastructure and using as little developer time as possible. This article outlines the approach we took and hopefully it’ll help you avoid a few of the dead ends we came up against. This isn’t a user manual for the AWS services we used, I’ve assumed a basic familiarity with AWS and Google Tag Manager.

Google Tag Manager makes it incredibly easy to set up custom event tracking in Google Analytics. You can pull any data you like from the DOM, a data layer or global JavaScript and send it to Google Analytics as part of an event, custom variable or dimension.

It turns out that it’s not much more difficult to push data to Amazon Redshift using Tag Manager. From there you can do whatever custom reporting or data analytics you like with your favourite data visualisation or data mining tools.

Advantages of this approach

Before I get to how, it’s worth pointing out the advantages of this approach compared to logging to a database on your server and then using a traditional ETL process to get the data to Redshift.

  • It can be implemented without developer support.
  • It decouples your data capture from your code base and infrastructure
  • You can get much closer to real time data.

Google Tag Manager Steps

The outline of how this works is pretty simple. First, grab whatever data is required from the DOM, data layer or global JavaScript using GTM and fire it as a JSON string to Amazon Kinesis Firehose.

Firehose buffers the data for us (up to 5000 objects per second) and when the buffer is full or it reaches a preset time, it dumps the data to a file in S3 and runs the copy command to get it in to Redshift. This neatly sidesteps the painfully slow INSERT/UPDATE speed of Redshift.

I’m not going to go into detail about setting up Redshift clusters or Firehose streams, you’ll have to RTFM here and here. It’s well worth going through the stock ticker example and trying out the Firehose/Redshift combination before you try with your own data. This is all pretty straight forward, the hardest part is fighting the be-tentacled monster that is Amazon’s IAM permissions system.

Make sure you do all your config and set up in the same AWS region. You can theoretically get the different services to work cross region but you’ll have far fewer headaches if you keep to a single region.

In order to use the AWS JavaScript SDK you’ll need to set up a federated identity pool in AWS Cognito, this allows you to authenticate with the SDK and its services without exposing your AWS credentials. Details are here. Once you’ve set up the identity pool, you can select the type of SDK you want to use, JavasSript in this case, and Cognito will generate a snippet to load and authenticate the SDK.

You’ll need two tags in GTM, one to initiate the Firehose service and one to send the data. The first tag loads the SDK, authenticates it and creates an instance of the Firehose service. You can load this tag site wide or just on pages you want to capture events on using a trigger. The first tag needs to load before you can push any data, so if you’re looking to send data on page load then you can use tag firing priority to make sure it loads first.

<script src="https://sdk.amazonaws.com/js/aws-sdk-2.154.0.min.js"> //load the DSK</script>
<script>
  // Initialize the Amazon Cognito credentials provider and authenticate
AWS.config.region = 'eu-west-1'; // Region
AWS.config.credentials = new AWS.CognitoIdentityCredentials({
    IdentityPoolId: 'eu-west-1:yourawspoolid',
});

var firehose = new AWS.Firehose(); //initialise the Firehose service
</script>

The second tag sends the data using the putRecord , it will look a bit like this:

<script
var params = {
  DeliveryStreamName: 'your_firehose_stream', 
  Record: {
    Data: JSON.stringify({
	"partner_name":"{{partner}}",
	"click_date":"{{datestamp}}",
	"provider_name":"{{provider}}",
	"provider_id":{{provider_id}},
	"product_name":"{{product}}",
	"product_id":{{product_id}},
	"product _cost":{{price}}
	})
  }
};
firehose.putRecord(params, function(err, data) {
  if (err) console.log(err, err.stack); // an error occurred
  else     console.log(data);           // successful response
});
</script>

Change the JSON string to whatever you want. The variable names should match your column names in Redshift so that the automatic JSON option in the COPY command can put everything in the right place.

I found it very useful to make a version of this script with some static dummy data in it and run it in the console to debug. If Firehose is receiving the data you’ll get a record ID logged to console; if not you’ll get a surprisingly useful error message which helped me get the IAM permissions right.

When you put your own JavaScript into Tag Manager, you can use the double curly brace notation above to use Tag Manager variables. If you’re already sending custom data elsewhere you can re-use variables, or create new ones from DOM objects or data layer.

You’ll probably want a timestamp on your data, Tag Manager doesn’t provide a timestamp tool but this bit of JavaScript by Simo Ahava does the job nicely.

<script
function() {
// Get local time as ISO string with offset at the end
var now = new Date();
var tzo = -now.getTimezoneOffset();
var dif = tzo >= 0 ? '+' : '-';
var pad = function(num) {
var norm = Math.abs(Math.floor(num));
return (norm < 10 ? '0' : '') + norm;
};
return now.getFullYear()
+ '-' + pad(now.getMonth()+1)
+ '-' + pad(now.getDate())
+ ' ' + pad(now.getHours())
+ ':' + pad(now.getMinutes())
+ ':' + pad(now.getSeconds())
// + '.' + pad(now.getMilliseconds())
//+ dif + pad(tzo / 60)
// + ':' + pad(tzo % 60);
}
</script>

Just create a new custom JavaScript Tag Manager variable called timestamp and paste the code above. You’ll see that I’ve commented out a couple of lines so that the date comes out in the default format for Redshift.

Once you’ve got data going into Redshift, you can start using it for whatever reports you like in your choice of data vis, analytics or data mining software. Google Data Studio works with Redshift now and lets you throw dashboards together in no time, it’s free too. If you want a bit more power and slickness in the reports then Tableau is a good bet. Redshift works really well with most BI and data vis tools.

Need something like this doing but don’t want to do it yourself? Give us a shout.