Analysis of website browsing behavior of prospects and customers can provide Marketers with a goldmine of information around needs, wants and preferences of their target audience. In order to perform this analysis though, Marketers need to move beyond reporting to advanced analytics that leverages user level clickstream data to build advanced predictive models for specific business scenarios.

The technology challenge for enabling this advanced clickstream analytics lies in consolidating a very large number of browsing signals from across multiple sessions (and even devices) into consolidated, user level profiles that can be used for modeling. Getting access to this raw data though is a significant challenge with most contemporary web analytics tools given that these tools aggregate data into predefined data models making them more suitable for aggregate campaign/channel level reporting as opposed to advanced business analytics. Given the sheer volumes, velocity, and variety of data generated from web browsing, this technology challenge becomes an interesting use case for practical applications of big data technologies in digital marketing optimization.

In this post, we provide a conceptual overview of how tech-savvy Marketers can quickly build their own web analytics tools that provide a commercially viable alternative to using off the shelf web analytics software. We outline a modular solution development approach along with a discussion of a wide range of underlying big data technologies which Marketers can combine to build purpose-built, cloud-hosted, and fully compliant web analytics solutions for turbo-charging their customer intelligence initiatives.

The solution outlined here is abstract and conceptual in nature and even though we reference Amazon Cloud components in our discussion, clients are free to apply the concepts to other cloud platforms such as Azure, Rackspace, and Google. It is also entirely possible to implement all the components below in-house using internal hardware and is the recommended approach when security and compliance obligations require more control of data.

Identifying the solution building blocks

In line with enterprise architecture best practices, we begin by identifying the various ‘building blocks’ that make up our conceptual solution. These building blocks are abstract technology components that only define the business functionality they are required to implement while leaving the choice of physical technologies to client specific considerations. The five functionality blocks include –

Log Server – Clients (browsers of site visitors) send data out to the log server using image ‘pixels’ hosted on this server. Every time, this pixel is downloaded, a log entry is created in the server and the information contained in this request URL (sent by the Tracking engine described below) can be used to build up user-level data sets using appropriate data processing. A key technical capability of a log server is to be able to service a very large number of pixel requests from client browsers. Content Delivery networks can easily provide the technical platform for these log servers. For example, developers can quickly setup Amazon Cloudfront to service pixel requests without having to launch their own cluster of web servers.

Data collection engine – While the log servers provide transient storage of raw log data, the data collection engine provides a persistent data storage layer where raw log entries are processed into user-level datasets. In the Amazon Cloud stack, the collection engine could be implemented using Amazon S3 which theoretically provides unlimited storage capacity implying that developers do not have to worry about collection engine running out of physical disk space as the number of ‘website hits’ grows. Alternative implementation choices for the data collection engine could be Hadoop HDFS or Spark but these would require implementing technologies such as Apache Flume and Kafka (or other similar variants) to somehow transfer the data from log server into collection engine.

Transformer – This component performs two functions. First, it constantly fetches data from the log server into the data collection engine using custom processing rules. This allows the log server to run seamlessly without running out of disk space or memory. Second, the transformer processes the raw log data in collection engine into final user-level data sets using custom business rules. For example, clients may define rules around how raw log data is sessionized, which log items to discard (e.g. discard bot traffic) and other processing rules such as those for organizing raw events into chronological order. Multiple technology options exist for implementing the transformer. For example, PIG provides a scalable ETL platform ideal for processing files if the collection engine is implemented using HDFS. Other open source ETL technologies such as Talend and Pentaho Kettle could be used in scenarios where the collection engine is implemented using normal file systems or databases.

Storage engine – The data processed by the transformer is finally transferred to the storage engine that is custom built for specific analysis needs. RDBMS engines, data warehouses, and columnar databases are all possible technology options for implementing the storage engine. Within Amazon stack, the RDS service can be used as a scalable RDBMS storage engine for both analytics and reporting use cases. Redshift provides options for implementing data warehouse schemas when the requirement is primarily for drill-down reporting on large volumes of historical data. Other options include Amazon Dynamo or NoSQL databases such as Cassandra and MongoDB.

Client-side tracker – The final building block of our custom web analytics solution is the client side tracker. This is a simple piece of JavaScript code that is downloaded to every visitor browser and uses JavaScript to record clickstream data. This data is then sent to the log server using a transparent image pixel. The tracker converts the clickstream data into a query string and then appends it to the pixel image request. When this pixel is downloaded from log server, a ‘hit entry’ is created on the server and is eventually transferred to the collection engine via the transformer.


The conceptual building blocks described above can be easily assembled to build powerful visitor analytics solutions that can significantly enhance a company’s ability to analyze website behavioral data using user-level data sets. Using this modular approach, clients can build advanced big data applications that provide bottom-line business benefits within overall constraints of time, resource availability, security and legal compliance.

Article purpose

A short capability statement that outlines why and how Marketers can go about building in-house web analytics capabilities using common big data technologies. The main purpose is to target Enterprise Marketers who may benefit from in-house development of a web analytics engine as opposed to investing substantial amounts (both capex and opex) in off-the-shelf tools.

About Client

A UK based data analytics consultancy with niche offerings for the digital marketing industry.