Document Clustering, classification and Data Mining

If a decision tree is used, it will provide a set of branching conditions that successively split the customers into groups defined by the values in the independent variables. The aim is to be able to produce a set of rules or a model of some sort that can identify a high percentage of responders. A decision tree may formulate a condition such as:

customers who are male and married and have incomes over £50,000 and who are home-owners responded to our offer

The condition selects a much high percentage of responders than if you took a random selection of customers.

In contrast, a neural network identifies which class a customer belongs to, but cannot tell you why. The factors that determine its classifications are not available for analysis, but remain implicit in the network itself. Anther set of techniques used for classification are k-nearest neighbour algorithms

Understanding and prediction Sophisticated classification techniques enable us to discover new patterns in large and complex data sets. Classification is therefore a powerful aid to understanding a particular problem, whether it is response rates to a direct mailing campaign or the influence of various factors on the likelihood of a patient recovering from cancer.

In some instances, improved understanding is sufficient. It may suggest new initiatives and provide information that improves future decision making. However, in many instances the reason for developing an accurate classification model is to improve our capability for prediction.

For example, we know that historically 60 per cent of customers who are male, married and have incomes over £60,000 responded to a promotion (compared with only three per cent of all targeted customers). There is therefore a better than average chance that new customers who fit this profile will also be interested in our product. In practice, data mining algorithms can find much more complex relationships involving numerous predictor variables, thus providing a much finer segmentation of customers.

A classification model is said to be ‘trained’ on historical data, for which the outcome is known for each record. It is then applied to a new, unclassified data set in order to predict the outcome for each record.

There are important differences between classifying data in order to under-stand the behaviour of existing customers and using that classification to predict future behaviour. For a historical data set, it is often possible to produce a set of rules or a mathematical function that classifies every record accurately. For example, if you keep refining your rules you end up with a rule for each individual of the form:

100 per cent of customers called Smith who live at 28 Arcadia Street responded to our offer.

Such a rule is of little use for predicting the classification of a new customer. In this case, the model is said to be ‘over-trained’ or ‘overfitted’ to the historical data set.

Building a good predictive model requires that overfitting be avoided by testing and tuning a model to ensure that it can be ‘generalised’ to new data.

next previous