Leveraging Web Data Mining for improving Retargeting ROI

Using behavioral signals to estimate the likelihood of website visitors performing certain actions of interest is a classic use case for web data mining. This technique is used in multiple scenarios including customer contact optimization, content personalization, promotions optimization, demand forecasting, channel attribution and many more. This article discusses how visitor level clickstream data mining techniques can be used to optimize retargeting campaigns that form the staple of most direct response marketing strategies.

Business challenge

Retargeting is a much sought after technique that (apparently) offers significantly higher conversion rates than regular prospecting campaigns. The idea is simple. The fact that someone has already been to a company’s website and demonstrated purchase intent by performing certain actions (e.g. adding a product to cart) is taken as a proxy for an ‘in-market’ visitor. Using specialist retargeting tools, companies can then retarget these users on third-party sites using personalized creatives.

Notwithstanding the simplicity of this technique, critics argue (and rightly so) that claimed benefits do not take into account the fact that many of the users who convert from retargeting would have eventually converted anyway. While the criticism is true to a large extent, the fault lies in how retargeting campaigns are typically executed and not in the technology itself.

To identify the incremental lift from a retargeting campaign, Marketers typically implement split tests by randomly selecting a control group that is shown no ads and another default group where users do get to see retargeting ads. The incremental conversion rate/ROI for the default group above control group is then taken to represent the lift from retargeting.

While this is certainly an improvement over blindly accepting retargeting conversion rates as absolute truth, there is plenty of room for further optimization. For example, simple logic should reveal that the split test results would inevitably be muddled without removing high probability converters from both the default and control groups. Yet, the analytics sophistication required for isolating test audience means that only a select few Marketers are able to truly optimize retargeting spend and get accurate estimates of incremental lifts. In large campaigns with significant display advertising allocations, this could easily translate to thousands of dollars in wasted media spend.

The data mining challenge then is to identify and isolate visitors who already demonstrate high conversion propensity. The sections below provide a brief overview of how this can be achieved by applying predictive analytics techniques on clickstream signals.

Data mining approach

The sections below provide a conceptual overview of the various steps involved in addressing the challenge outlined above. The actual methodology that we use for deploying data mining solutions in client environments is fairly prescriptive and structured (e.g. CRISP-DM) but hopefully, the discussion below will provide readers with an overall flavor of the scaled up version.

Step 1-Clarify the business objective

While the business challenge outlined above certainly provides a good context to analysis, it is not specific enough in terms of expected outcomes. As a best practice, the first step we undertake in any analytics assignment is to develop an explicit consensus on expected business outcomes. For this example, it was decided that we need to establish the true cost of user acquisition from retargeting after accounting for users who were likely to convert on their own without any retargeting stimulus. This would help the Marketing team decide as to whether alternative channels might be better suited for meeting acquisition targets from an ROI perspective

Step 2-Identify the data mining technique

The technique deployed for modeling has a huge impact on the overall data strategy for any data mining project. Some techniques require only numeric variables, some work poorly on small sample sizes, some provide highly variable estimates in scenarios where many independent variables cannot be realistically captured and so on. Deciding on the technique upfront provides an opportunity to make early assessments of data complexity and avoid failed lift-offs.

For this example, it was decided to use a special form of logistic regression which is a common technique used in clickstream analytics for estimating the probability of a user converting (dependent variable) as a function of various independent variables. ROC curves for resulting models can be then used to select thresholds for converting continuous regression outputs into binary flags (likely to convert or not). Variations of this technique such as the bagged logistic regression method are more suited in scenarios that involve a large number of unknown independent variables that result in wild fluctuations of model predictions. This applies especially to retargeting given that a user typically engages with multiple channels (web, mobile, offline ads) before making a purchase decision. In the absence of a holistic set of independent variables to represent all the diverse touchpoints (we focus solely on web interactions from a single browser), a bagged logistic regression technique was deployed for estimating conversion probabilities. Effectively this meant, dividing the training dataset into multiple partitions and building out logistic regression models for each. The final model would then be the one that used an average of coefficients from multiple models. This is a fairly common technique for building stable models whose predictions do not fluctuate wildly across varying samples

Step 3-Model Conceptual design

While Step 1 identified the statistical modeling technique/algorithms to use for analysis, the model design step involves a secondary set of activities (that all depend upon modeling technique is chosen) including (but not limited to)

  • Define conversion-This must be done in technical terms and at the data element level. A conversion is any event of significance and does not necessarily have to be a financial transaction. In the example above (for a financial services client), we defined conversion as a free trial signup.
  • Identify non-converters-When doing regression, it is important to include non-converters in the training dataset in order to reduce prediction bias and provide grounds for comparison. Non-converters were identified as all visitors who failed to convert within 30 days of first arriving on site.
  • Identify evaluation window-For this example, the evaluation window was fixed at 30 days from the end of a campaign in order to capture data from visitors who first visited towards the end of a campaign
  • Develop test plan-It was decided to test the models using cookies with no prior exposure to any form of retargeting. An overall test sample (A) representing about 100,000 unique cookies (containing both converters and non-converters) was identified for the purpose.
  • Number of models-A standard practice in our modeling work is to develop at least 2 models when working with classifiers in scenarios that involve diverse input variables. Since the proportion of converters to non-converters in a live environment can have large variations, it was decided to build 2 models using different target proportions.
  • Identify partitioning strategy-Since both converters and non-converters are required for building the model, it was decided to work with two datasets (A1, A2)- one containing 1:4 ratio of converters to non-converters (A1) and the other using a ratio of 1:10 (A2).A1 and A2 would each be further divided (A11, A12…A15, A21, A22…A25) into smaller datasets for building a total of 10 models (5 in each set) and also ensuring that the target proportions remained the same in each dataset. 2 bagged regression models would be developed-one each for A1, and A2

Step 4-Independent variable selection

With the modeling technique decided and the conceptual design laid out, the next step is to identify the independent variables for our model.

Variable selection is the single most important factor affecting the quality of a model. Various factors come into play when selecting variables including bivariate collinearity, correlation with the dependent variable, frequency distributions, missing values, technical capabilities of the web analytics tool used, expertise in clickstream ETL and so on. We will skip the exact method used for identifying variables for this particular example, but in conclusion, the following initial set of variables was deployed for model building

Recency A derived metric representing the time lag between a visitor’s last visit before conversion event and conversion event itself. For non-converters, this is simply the lag between last visit and end of evaluation time window
Frequency A categorical variable (derived from individual visit timestamps) for how frequently did the visitor visit site during evaluation window
Visit value An index was defined to score each visit based on actions performed during the visit. For e.g. a visit where a user downloaded a platform features guide was given a higher index as compared to a visit with only one pageview
New vs. returning An indicator to show whether the visitor was a new or returning (based on an all-encompassing time frame and not just during evaluation period)
Engagement index A composite index that included data on interaction (pageviews, time spent metrics, interaction with educational interaction videos) with key sections of the site deemed relevant for target campaign audience
Video interactions

The client had made significant investments in the production of educational videos that dealt with varying topics from investment capital risk management, risk/reward assessments, investment psychology and technical platform usage. Because of the financial costs involved in the production, it was decided to include video interactions index as a free-standing variable as opposed to clubbing it with the Engagement index

The visit level index was a derived metric ranking a visit on a scale of 1-10 based on analytics data. Custom action script code was deployed to record events that tied the interaction data with each visitor id at cookie level

The visitor level index was calculated by using a weighted average method that gave more weights to visits nearer to the time of conversion/end of the evaluation window

Acquisition source Web Analytics data had indicated that visitors from particular sources had consistently demonstrated a much higher than average conversion rate. As such, acquisition source was deemed to be a critical variable in predicting conversion probability


It must be noted that the variables above represented the final set that was arrived at after extensive iterations and factor analysis on more than 25 different variables that were originally shortlisted for model building.

Step 5-Data collection

A fundamental pre-requisite to customer-centric web data mining is the ability to assemble visitor level data from across multiple sessions. Traditional web analytics tools that record clickstream data are unlikely to be suitable for this form of analysis without extensive hacking and will most likely ruin the cost/benefit equation for the entire mining effort. Specialized tools exist on the market (e.g. Celebrus, Candii, Segment) that have the ability to record visitor level data organized hierarchically and chronologically in sessions and events per page. Alternatively, tech-savvy Marketers can quickly assemble their own web analytics solution using open source big data technologies (e.g. Amazon S3, EMR, and PIG). A common practice here is to sessionize hit level log data from pixels into user-level JSON files using PIG. Further ETL can then be used to organize JSON data into RDBMS tables where data is organized in rows at the visitor level. The latter form is more suited for data mining using tools such as R and SAS

For our example, all cookie level data was captured from a popular enterprise customer analytics platform that was already deployed at the customer site. In line with client IT requirements, JavaScript code deployment was kept to a minimum and most of the data manipulation was done using ETL on analytics warehouse data.

Step 6-Model development

This step represents the most technical stage of statistical data analysis. In our example, extensive use was made of standard R packages in addition to a number of proprietary packages that we routinely use to increase productivity (custom packages for automating tasks such as identifying variables with high bivariate collinearity, low co-relation with output variable, outlier detection, normal distribution checks, missing values, duplicates and many more).

The outputs of this phase were two separate models-one each for a dataset containing 1:4 and 1:10 ratio of converters to non-converters

Step 7-Model evaluation

In step 6, we developed 2 regression models to predict the conversion probability of users as a function of certain independent variables. ROC curves and Area under Curve (AUC) metrics were then used for comparing the two classifiers. Each model produced a numeric conversion probability per user and overall 17 different probability thresholds were used to construct the ROC curves for each model. Perhaps not surprisingly, the model using 1:4 proportion of converters to non-converters performed slightly better on the AUC metric.

When deployed, our model would implicitly bar a user from being shown retargeting ads as he would be deemed to have a high probability of conversion anyway. This implies that we must include a maximum number of users who are unlikely to convert on their own, or alternatively, minimize the number of visitors who are wrongly classified as likely to convert when they actually don’t. In technical terms, this meant minimizing the number of false positives. A probability threshold was selected in line with this requirement and all users above this score were deemed to convert.

Step 8-Model deployment

Records for all visitors (cookies with unique IDs)who were score were flagged with a 0/1 for conversion propensity in the analytics data warehouse. Writing back the flag to visitor cookie from data warehouse was done using simple server-side scripting that checked the presence of a retargeting flag cookie and set one if none existed. The Tag Management solution deployed at customer site simply checked the presence of this cookie before firing the retargeting tag. On the offline side, regular monitoring of model prediction results was set up to track classification errors over time.

Business Outcomes

Before running the split tests (without excluding high probability converters from both default and control groups), the Marketing team had calculated retargeting campaign conversions rates at about 2.5%. With an average retargeting CPC cost of $8, this translated into a cost per acquisition of $320. After excluding high probability converters, the true incremental conversion rate from retargeting fell to 1.2% with the cost per acquisition ballooning to $667! With other channels such as native ads, and sponsored content (highly popular videos based on web analytics data) providing far lower acquisition costs, it was immediately decided to review the entire budget allocation approach given the sheer volume of media spend at risk of under-optimized spend.


The article above describes an isolated case of how a carefully planned and executed web data mining project can provide tangible benefits not just to marketing but to the entire corporate bottom line. The technical sophistication required for implementing advanced analytics solutions in the digital world may be much higher than that for traditional database marketing (e.g. need for deep understanding clickstream data models, cookies, JavaScript, special analytics data warehouses, big data volumes, applied statistics techniques etc.) but so is the return. For direct response marketing involving large digital acquisition budgets, not taking this leap is simply not an option.

Article purpose

To outline the conceptual approach for a specific use case of using advanced analytics with clickstream data.

Target audience

Customer insights Managers and data analysis professionals in large marketing departments. Looking for specific guidance on how to assemble, analyze and interpret results from complex, multi-touch advertising campaigns.

About the Client

A boutique Analytics Consultancy working specifically with Enterprise E-commerce customers running multi-million dollar, digital marketing campaigns.