The importance of applying advanced statistical modeling techniques to analyze click stream data seems to be undisputed but yet, even some of the most advanced Digital Marketers struggle to develop a repeatable, top-down approach to web content mining. Ad-hoc use cases (typically borrowed from other verticals) are identified and implemented using basic analytics capabilities (data profiling through visual exploration, descriptive statistics etc.) with modest to no impact on strategic marketing bottom line while still resulting in large amounts of wasted analytics expense. Senior Managers (budget holders) by and large continue to live in the past where web analytics was synonymous with Google Analytics/Omniture and where the typical job profile of web analysts was to pull out random reports of aggregated data of highly questionable utility.

The reason why web clickstream data remains largely untapped for extracting advanced digital intelligence essentially lies in the mind set and skills of traditional web analysts who largely equate web analytics to reporting and/or advanced multi-dimensional analysis. Without proper grounding in applied statistics, a large majority of problems that are perfectly solvable using web data mining techniques are simply ignored as being not possible. Worse, many a times such problems are not even identified leaving a majority of old school web analytics departments look bloated with inefficient use of headcount budgets.

We argue that systematically operationalising web content mining techniques as part of a strategic capability requires as a first, an explicit identification of use cases relevant to a particular business environment. To this effect, we recommend that clients follow a time-tested, patterns based approach to identifying uses case ‘instances’ as a first step towards developing an in-house web content mining capability. We identify six patterns where clients can almost certainly identify compelling use case instances that would not be possible to implement without advanced statistical modeling. These include-

Understanding drivers of behavioral actions – A website produces multiple marketing collateral distributed through various marketing channels (each with its own cost and operational effort profile) to generate leads and trial account sign-ups. How can we use behavioral data (acquisition tactics, onsite content consumption, consumption via third party sites etc.) to better understand which tactics have the most significant impact on conversions? Can we use lower cost/less complex product/channel combinations to achieve the same results? While this ‘use case instance’ maybe more relevant for a B2B marketing environment, it should not be hard to find such pattern instances in eCommerce or content publishing sites. From a data mining perspective, techniques such as regression analysis and decision tree modeling can be used with great impact in solutions for this pattern

Establishing a cause and effect relationship – Establishing a cause and effect relationship with statistical validity is a common use case for leveraging data science techniques in the context of web analytics. Consider the following simple scenarios

  • Did a drastic spike in sales have anything to do with new website design? Was it because of the launch of a new marketing campaign? If so, was this the only factor or was there another important variable that accounted for this?
  • A new DIY furnishing tool was launched. What was the incremental impact (if any) of this tool on increased site engagement and traffic?
  • A product receives a lot of likes and positive reviews. Can this be a cause for increase in conversions?

The usual approach to establishing a cause and effect relationship is to do split testing but the process can be long and unwieldy given the complexity in setting up and running experiments in low traffic environments. While split testing is most certainly not possible in all scenarios, using data science techniques (predominantly Bayesian statistical methods and dependency analysis) can be an effective solution in majority of the cases even with low sample sizes.

KPI Modeling & Forecasting – Generating models to predict values of various important KPIs using various behavioral (individual level) or aggregate data is another common pattern of data mining application. Using data mining allows not just predicting values but also being able to assess the relative importance of each of the input variables in a regression equation. Take the example of predicting the conversion rates for a particular high value subscription product on an Ecommerce site. A good web analytics tool should be able to produce simple trend lines that would allow forecasting of values at a future point in time. What is likely to be most certainly missing is an insight into how the trend line was arrived at and the ability to assess the incremental contribution of each of the input variables. So, to get a unit increase in conversion rate, what are we best off modifying and by how much? Acquisition spend on PPC? Or maybe Retargeting spend? Making the user spend more time on site? A Promotional offer? As a Marketer, my obvious preference would be to implement the tactic that is least expensive and most cost effective to implement. Which one should I choose? A properly constructed regression analysis can provide Marketers with a tonne of extra flexibility in improving such key KPIs.

Segment profiling – Consider a scenario in an Ecommerce business where the company sells ‘Product bundles’ or combination of core and upsell products as part of the same transaction (e.g. selling a car rental as an upsell along with air ticket as the core product). Some customers show exceptionally high upsell take rates as compared to others. What do we know about these customers? Particular data points of interest include

  • As compared to other segments, were they exposed to specific media campaigns?
  • Were they first time purchasers?
  • If they had been to site earlier, did email marketing have any role to play in their purchase choice?
  • If first time purchaser but a repeat visitor, how long on an average is the time taken (and number of visits) to sale?
  • Which product bundle configurations (combination of core/upsell products) produce the highest average order value from this segment? In other words, is the purchase behavior of this segment the same across different core/upsell combinations?
  • Do such visitors already have an intention to buy or is there something on the website/other client controlled Marketing stimulus that makes them do so?

It is certainly true that of all the data science patterns, profiling can be most effectively implemented by a good web analyst without using any statistical modeling techniques. However, the time and skill level required to do this level of analysis manually, with statistical validity, for multiple visitor segments and on a repetitive basis, precludes a majority of web analytics departments from undertaking such ambitious projects in-house.

Automated segment identification – Assigning website sessions to one of the various pre-defined journey types is a regular feature in advanced web analytics implementations and is critical for identifying meaningful optimization opportunities. If a business banking user comes to the site looking to setup a new vendor recipient, it hardly makes sense to judge a visit with whether a business account was opened or not as an outcome. The trick here is to use click stream data to assign a journey type for each visit. From a data mining perspective, this problem is handled using classification techniques whereby, given a pre-defined profile of various journey types, each new session is assigned a ‘class’ or a journey type using behavioral signals from current session.

Personalization – Out of all the data mining patterns applicable to web analytics, this one is probably the only one where a large number of vendors provide out of the box solutions that can be quickly customized for specific client scenarios. Yet, numerous examples exist where Clients feel that the total cost of ownership of implementing and customizing a third party product will prove to be far higher than building a solution in-house. Various machine learning techniques exist and which can be easily implemented using pre-built frameworks such as Mahout. From a data mining perspective though, the technique used to implement this use case is Association which deals with identifying affinities of data items (i.e., data items or events which frequently occur together).


Identifying specific use cases to address within these high level patterns should be the first step in building up a web data mining capability. With a full view of business challenges to be addressed, clients can then look at people (statistics, data engineering, domain skills), process (roles and responsibilities) and technology (big data, statistical modeling tools etc.) changes that need to be put in place for enterprise wide roll-out of this strategic capability.

Article purpose

A ‘best-practices’ theme based short-form article articulating a unique approach to identifying statistical data analysis use cases for clickstream data.

About the Client

A UK based, boutique data analysis consultancy focused on providing marketing analytics and insights services for E-Commerce and Retail sectors.