Diving Into Data Mining

October 28, 2005

Diving Into Data Mining

Diving into Data Mining
 
by Dr Fern Halper, Partner

What is Data Mining?

According to Berry and Linoff (Mastering Data Mining, John Wiley & Sons Inc. 2000) ?Data mining is the process of exploration and analysis of large amounts of data in order to discover meaningful patterns and rules.?  This is often achieved by automatic and semi-automatic means.  Data mining came out of the field of statistics and Artificial Intelligence (AI), with a bit of database management thrown into the mix.  The term predictive analytics, while a bit of a misnomer, is a loose term used to describe advanced modeling.  People now use the two terms interchangeably, or sometimes refer to simple data profiling as data mining  – a historical misnomer, but now seemingly acceptable.

The goal of the data mining analysis is either classification or prediction. In classification, the idea is to predict into which classes (i.e. discrete groups) data might fall. For example, a marketer might be interested in who will respond vs. who won?t respond to a promotion.  These are two classes.  In prediction, the idea is to predict the value of a continuous variable.

Common algorithms used in data mining include:

  • Classification Trees. This is a popular data mining technique that is used to classify a dependent categorical variable based on measurements of one or more predictor variables.  The result is a tree with nodes and links between the nodes that can be read to form if-then-rules.
  • Logistic Regression. This is a statistical technique that is a variant of standard regression but extends the concept to deal with classification.  It produces a formula that predicts the probability of the occurrence as a function of the independent variables.
  • Neural Networks. This is a generic software algorithm that is modeled after the parallel architecture of animal brains.  The network consists of input nodes, hidden layers, and output nodes.  Each of the units is assigned a weight.  Data is given to the input node and by a system of trial and error the algorithm adjusts the weights until it meets a certain stopping criteria.  Some people have likened this to a black box approach.
  • Clustering Techniques. Like for example, K-nearest neighbors – a technique that identifies groups of similar records. The K-nearest neighbor technique calculates the distances between the record and points in the historical (training) data.  It then assigns this record to the class of its nearest neighbor in a dataset. 

An Example

Let?s examine a classification tree example.  Consider the case of a telephone company that wants to determine which residential customers are likely to disconnect their service.  They have on hand information consisting of the following attributes:

How long the person has had the service
How much they spend on the service
Whether they have had problems with the service
Whether they have the best calling plan for their needs
Where they live
How old they are
Whether they have other services bundled together with their calling plan
Competitive information concerning other carriers? plans
AND Whether they still have the service or they have disconnected the service. 

Of course, there can be many more attributes than this.  The last attribute is the outcome variable; this is what the software will use to classify the customers into one of the two groups ? perhaps called stayers and flight risks. 

The data set is broken into training and a test set.  The training data consists of observations (called attributes) and an outcome variable (binary in the case of a classification model)? in this case the stayers or the flight risks.  The algorithm is run over the training data, and comes up a tree that can be read like a series of rules.  For example,

If the customer has been with the company >10 years and they are over 55 years old, then they are stayers

These rules are then run over the test data in order to determine how good this model is on ?new data?.  Accuracy measures are provided for the model.  For example, a popular technique is the confusion matrix.  This matrix is a table that provides information about how many cases were correctly vs. incorrectly classified.  If the model looks good, it can be deployed on other data, as it is available.  Based on the model, the company might decide, for example, to send out special offers to those customers that they think are flight risks. 

This technique is a type of ?supervised learning? where the outcome is known and training is used.

It?s Not Just About The Algorithms

Clearly, a number of steps need to be undertaken in order to make the process work.  Access to clean data is required, and issues like missing data need to be dealt with.  For example, how should a blank field be handled?  Outliers?  Does the data need to be normalized – transformed to make sense with the rest of the data?  Additionally, sound analytical techniques call for data to be explored before starting to run the algorithms.   This may include visualization and simple statistical analysis.  In this way, the analyst can determine whether all attributes need to be used and also come up with some hypotheses about the data. 

After the analyst is comfortable with the model results, it needs to be deployed. This means that there needs to be real data available in company systems (not the training or test data) that the model can actually run against.  Additionally, some sort of process needs to be put in place both to schedule when the model runs and to deal with the results.  Vendors such as SPSS and SAS have the right philosophy about all of this.  For more on SPSS, take a look at the article entitled, ?Empowering Business Analysts with Predictive Analytics? in this newsletter.

A Final Word

 
Data mining is very powerful, it has been around for years, and Hurwitz & Associates is excited that it is finally hitting the mainstream.  Businesses who haven?t considered it yet should do so – quickly.

 

Newsletters 2005
About Fern Halper

Leave a Reply

Your email address will not be published.