10 pitfalls to avoid in commercial data mining

Why do data mining projects fail and some key pitfalls to avoid

Dheeraj Saxena, July 10, 2020

Data mining is a business process that requires business and domain knowledge. These are far more important than knowledge of algorithms and data technologies. Trivializing the problem to the choice of mining technique or data collection technologies is one of the most prominent drivers of failed data mining projects.

Introduction

A typical data mining project is full of risks and pitfalls which must be identified upfront. A suitable resolution must be defined for each risk in order to avoid wasting your investment in data mining. This article outlines the risks, debunks myths, and attempts to provide some protective “hard hats” for data miners.

The underlying message is that data mining is a business process to deliver a business solution. Embarking on a data mining journey without being able to appropriately frame the business problems, leads to getting sidetracked with technical details, or attempting delivery without appropriate governance.

Data Mining projects must be careful to avoid the following pitfalls:

Pitfall #1-Inability to convert the business problem into a data problem

Let us look at a typical business challenge involving data: The need to identify de-duplicated conversions and attribute them to individual channels in a large online marketing campaign. What does this mean from a data perspective? Here are potential ways that this could be interpreted from a data point of view:

  1. Identify the impact of pre-acquisition customer touchpoints (a.k.a display ads) in driving visitors down the conversion funnel. The proportion of spend on display ads is large and we do not have visibility into how this impacts conversions. The data challenge here is to collect cookie level information about various impressions and then use this information along with converter/non-converter data to isolate the impact of display ads.
  2. We do not care so much about pre-acquisition touchpoints that do not actually involve clicks since we only pay for clicks. What we need is insights into de-duplicated conversions from paid search, paid social, native ads, and affiliates which are our main channel of acquisition. The data challenge here is to collect the cookie level touchpoints (across domain/device/browser) at an individual channel level and then build up some form of a regression analysis that provides insights into the incremental impact of each channel on conversion.
  3. Most of our conversions happen on a third party site(e.g. Amazon) but we drive traffic to our various microsites where we first provide our prospects with educational material. Our attribution challenge is to identify which campaigns drive the most sales on Amazon. The data challenge here is to connect user journeys across domains at the campaign level and then develop a model that identifies the isolated impact of each on final conversions.

Without an explicit idea of the specific data challenge that you are looking to address, things will most likely get off to a false start in the first place.

Pitfall #2-Attempting to collect large volumes of data

Big data has almost become a fad when it comes to data mining. Novice analysts and data miners spend far too much time thinking about collecting large volumes of data in order to not miss out on granular details. WRONG!

There is something called sampling. It takes a lot of skill to get right. Investing time in acquiring these skills or in people who are experts in this leaves businesses with investing inordinate amounts of hardware and software resources just to collect data. Not only does this bump up costs, but also significantly delays time to insights.

Notice though that while you can not just assume that a good model will require large volumes of data, that this scenario can not be completely ruled out either. For example, if individual data rows have a lot of varying characteristics, then it may well be needed to collect a large dataset in order to cover all possible factors. This though, is usually an exception and not a norm.

Data sampling, just like the wider data mining practice is much more of an art than a science. The underlying objective is to ensure random samples that are both deep and wide in order to build good models. Probability-based (e.g. Cluster, Stratified, Simple random) and non-probability sampling techniques (e.g, quota, snowball) should both be assessed in order to avoid brute force data collection.

Pitfall #3-Insufficient business knowledge

Data mining is a business process, driven by business needs and constraints. High-level knowledge of the domain and business are critical, in addition to having a very good grasp on operational details. For example, an ad operations analyst who spends the bulk of his time doing campaign configuration is unlikely to have the strategic insight into high-level business drivers, and company direction.

Similarly, a department head who has to typically deal with all sorts of extraneous issues (people management, delivering projects, departmental politics!) is unlikely to have the time or even the functional knowledge needed to act as a bridge between the business and technical teams.

The overwhelming need here is that of a senior domain specialist who brings in deep business knowledge (not just domain knowledge) and who can provide project steer upfront.

Pitfall #4-The disappearing terabyte

Consider a scenario wherein a telecom provider wishes to identify customers most likely to churn and offer them promotional offers. Offering promotions to customers who are likely to stay anyway or to ones who are going to run regardless is likely to cause wasted expense and must be avoided.

The company offers multiple plans and they wish to do this analysis for customers on each plan. The data planners start to think in advance and spend almost a year collecting all sorts of behavioral and transactional data about their customers.

A deeper analysis reveals that their churn volume is actually quite low already and the ones that do churn, do so because of specific issues with customer service and availability of connectivity. Suddenly, all those terabytes of data collected, and all the time and resource investment in building up the software/hardware infrastructure begins to look like a wasted investment.

Pitfall #5-Insufficient data knowledge

Operational staff (end users) who are responsible for the day-to-day execution of business processes have the most detailed knowledge about data. These people must be made part of the project from very early stage rather than rely on documentation or self-study of client business processes and document formats.

This problem is significantly exacerbated when the database or some part of the process is outsourced to third-party agencies, who have no incentive to collaborate with data miners, especially if the miner belongs to another agency. Another scenario to consider is when people who originally designed the metadata have left the company and newer members do not have enough detail about data.

Appropriate time and budget must be allocated to ensure that there is appropriate knowledge transfer in this case.

Pitfall #6-Erroneous assumptions

Business and data experts are crucial resources to the project, but it is entirely likely that they may not have insights about every individual data case. In many cases, simply accepting their statements can prevent miners from discovering things that were not already known. this does not mean that
the data miner should unquestioningly accept every statement they make.

Typical examples of erroneous or misleading assumptions might include:

  • No customer can hold accounts of multiple types
  • No case can include more than one event of this type
  • Only the following values can be present in a data element

The data miner should always seek to confirm the validity of experts’ statements using exploratory data analysis of random samples. Data profiling and quality assessment tools do a great job of identifying such anomalies.

Pitfall #7-Ignoring data

Data profiling typically throws up all sorts of data anomalies like missing values, outliers, abnormal distributions, and so on. Since data mining works best on normally distributed data, novice analysts have an overwhelming tendency to ignore data in an attempt to create artificial normal distributions. This practice is full of peril and must be avoided at all costs.

Secondly, not all relevant data already sits in nicely designed MySQL tables. A lot of data in a typical organization actually sits outside of the database, e.g. in emails, in chat logs, phone calls, letters, reviews, and so on. A good model would benefit immensely from build data aggregations from across these diverse sources, rather than just ignoring them.

Pitfall #8-Incorrect tool selection or data jail-houses

Enterprise data mining software often has a tendency of using proprietary data formats that require a substantial amount of data transformation and prep work to be done in order to do modeling.

This is a gross distraction that must be avoided at all costs in order to avoid adding time, expense, and operational complexity to the project. Not only is this transformed data not re-usable should you decide to move to another tool, but also prevents you from comparing the modeling output with another tool.

Pitfall #9-Not knowing how to interpret models

It is incredibly hard to actually operationalize modeling recommendations without adequate massaging. Consider an attribution model for a marketing campaign that suggests a very low contribution from specific display ad campaigns and inventory partners. Purely going by modeling results, one would be tempted to strike off these channels from future campaigns. But in all likelihood, the company would have at least some contract with the inventory partners which would mean that replacing them outright may not be operationally possible.

The key here is to go beyond mathematic coefficients and indicators of model precision/accuracy and consider how the results can be best operationalized. In many cases, this may not even be possible but making that judgment itself requires deep business knowledge which is often missing.

Pitfall #10-Lack of governance

This is by far, one of the biggest pitfalls of a typical enterprise data mining project. Despite the best of intentions and resource allocations, data mining is often implemented in an ad hoc manner, with no clear goals and no idea of how the results will be used to deliver actual business outcomes. In order to produce useful results, it is critical to have clearly defined business objectives, data mining goals, and deployment plans, all formulated early on in the project. The need to jump into modeling technique selection or data assembly considerations must be resisted without a thorough contextual oversight.

A simple way of ensuring this is to use a standard process such as the CRoss-Industry Standard Process for Data Mining (CRISP-DM). Such a process provides a highly flexible framework built by collaborative inputs from a large number of industry leaders but which is flexible enough to be adapted to individual business context.

Summary

Succeeding in data mining involves avoiding common pitfalls. What is needed is a combination of strong governance, business knowledge, technical expertise, and a very strong strategic oversight that provides constant steer throughout the project lifecycle. The process is highly iterative, but knowing where things typically go wrong and identifying fixes for these early on in the initiative can go a long way in realizing the promise of data mining.

Check out our entire portfolio for many more such in-depth articles

View full portfolio