For data mining to be adopted within the wider business community and beyond a select set of academics and analytics practitioners, it is imperative that common myths about the technique be debunked. Failing this, the science is likely to remain confined to academic journals and scholarly articles rather than find its utility in addressing actual corporate challenges.

In this article, we discuss 5 common myths and associated facts surrounding the practice of data mining. Our attempt here is to build a case for looking at data mining as a comprehensive business process as opposed to common simplifications that often focus on the technical aspects of data mining in isolation.

Myth #1: Data mining can only be performed by skilled statisticians and mathematicians or technologists and tool experts (e.g. R specialists, SPSS Technicians, SAS Engineers etc.).

Quite the opposite is actually true.  Data mining is actually a business process in which an intricate knowledge of the business and underlying data is of paramount importance. When performed without business knowledge, data mining may well produce nonsensical results with no business relevance. In fact, the Engineers, Statisticians and other tool experts play a very limited role in certain specific stages of the data mining lifecycle (mostly data preparation and modeling). Framing of the business problem in analytical terms and interpreting the results of modeling are far more important that tool specific knowledge or mathematical skills.

Myth #2: Data mining is all about technical algorithms. The quality of predictions from a model will be directly linked to the sophistication of the underlying algorithm used.

This myth is common in undereducated analysts who are just starting out with their data mining journey. Technical indicators of model fit and accuracy are very rarely a barometer for judging the business outcome of a data mining endeavor. The best of the algorithms could produce completely meaningless or inconsequential results if the business problem is not framed correctly or if the underlying dataset is not prepared properly. Of course, this is not to trivialize the importance of new or improved data mining algorithms. The problem occurs when data miners focus too much on the algorithms and other technical aspects of the data mining process while largely ignoring the other 90–95 percent of the data mining process that involves appropriately framing the analytics problem, and accordingly pre-processing the data.

Myth #3: Data mining is all about predictive accuracy and a model that provides the best accuracy is best suited for production deployment.

It is a common misconception that the main criterion for evaluating the success of a data mining project is the predictive accuracy of the models it generates. This view, however, almost completely misrepresents the role of algorithms in the data mining process. While models obviously need to demonstrate a high degree of predictive accuracy, it is often the case that the final model deployed is not necessarily the one with highest prediction accuracy but one that provides a balance of accuracy and practical utility.

Consider for example a model that predicts the likelihood of a visitor making a website purchase. If the driving factors for this model are not clear and cannot be turned into practical customer journey optimization tactics, then its utility will be very limited regardless of how accurately it can predict conversion propensity. This is because no practical changes can be made to any aspect of the customer journey without understanding the reasons as to why certain visitors convert while others don’t.

Myth #4: Data mining requires a data warehouse and other complex underlying data infrastructure.

While there may be some truth to this widely held viewpoint, the existence of a properly designed data warehouse or other complex data infrastructure is more of a nice to have rather than a must-have requirement for successful data mining projects. A data warehouse is more commonly used for storing large amounts of historical data that can be aggregated into purpose-built operational data stores for largely reporting purposes. The underlying data model for a data warehouse or an operational data store is more suited for reporting rather than entity level predictive analytics. As such, in most practical deployments, the data used for mining is actually picked up from simple relational databases that can be easily populated using ETL techniques from raw data.

Myth #5: Data mining is only beneficial when there is access to copious amounts of raw data.

Data mining models are always generated using samples of data and the trick lies in ensuring that the sample is largely representative of the actual population. As long as this pre-requisite is met, the models generated can work with a high degree of accuracy and precision when deployed with production data. In any case, data mining is a largely iterative process and models need to be constantly refined as new data is generated.

Summary

In almost any business context, data mining is actually a comprehensive business practice that consists of a number of pillars including but not limited to processes, skilled resources, technologies, and governance. While tools and technologies do play a key role in determining the outcome of data mining investments, an overemphasis on these technical aspects or tackling them in isolation from the business context is only likely to produce failed investments and overall dissatisfaction with data mining as a strategic capability that actually comes with an immense potential of generating tangible competitive advantage.

About this article

A short-form article that describes common myths about data mining and how these prevent broader adoption of this technique across the enterprise.

Target audience

Non-technical audience that lacks the appreciation of data science beyond statistics and complex math.

Article purpose

Articulate the various elements of data mining and drive home the point that knowledge of business and underlying data is far more important that the choice of data mining tool or modelling algorithm.