Document Clustering, classification and Data Mining

Clustering

Clustering is an unsupervised operation. It is used where you wish to find groupings of similar records in your data without any preconditions as to what that similarity may involve. Clustering is used to identify interesting groups in a customer base that may not have been recognised before. For example, it can be used to identify similarities in customers’ telephone usage, in order to devise and market new call services.

Clustering is usually achieved using statistical methods, such as a k-means algorithm, or a special form of neural network called a Kohonen feature map network. Whichever method is used, the basic operation is the same. Each record is compared with a set of existing clusters, which are defined by their ‘centre’. A record is assigned to the cluster it is nearest to, and this in turn changes the value that defines that cluster. Multiple passes are made through a data set to re-assign records and adjust the cluster centres until an optimum solution is found.

Looking for clusters amongst supermarket shoppers, for example, may require the analysis of many factors, including the number of visits made per month, the total spend per visit, spend per product category, time of visit and payment method.

Clustering is often undertaken as an exploratory exercise before doing further data mining using a classification technique. For this reason, good visualisation support is a helpful adjunct to clustering: it enables you to ‘play’ with your clusters in order to see if the clusters identified make sense and are useful in a business context.

Association analysis and sequential analysis

Association analysis is an unsupervised form of data mining that looks for links between records in a data set. Association analysis is sometimes referred to as ‘market basket analysis’, its most common application. The aim is to discover, for example, which items are commonly purchased at the same time to help retailers organise customer incentive schemes and store layouts more efficiently.

Consider the following beer and nappy example:

500,000 transactions

20,000 transactions contain nappies (4%)

30,000 transactions contain beer (6%)

10,000 transactions contain both nappies and beer (2%)

Support (or prevalence) measures how often items occur together, as a percentage of the total transactions. In this example, beer and nappies occur together 2% of the time (10,000/500,000).

Confidence (or predictability) measures how much a particular item is dependent on another. Because 20,000 transactions contain nappies and 10,000 also contain beer, when people buy nappies, they also buy beer 50% of the time. The confidence for the rule:

next previous