Document Clustering, classification and Data Mining

When people buy nappies they also buy beer 50% of the time.

is 50%..

The inverse rule, which would be stated as:

When people buy beer they also buy nappies 1/3 of the time

has a confidence of 33.33% (computed as 10,000/30,000).

Note that these two rules have the same support (2% as computed above). Support is not dependent on the direction (or implication) of the rule; it is only dependent on the set of items in the rule.

In the absence of any knowledge about what else was bought, we can also make the following assertions from the available data:

People buy nappies 4% of the time.

People buy beer 6% of the time.

These numbers -- 4% and 6% -- are called the expected confidence of buying nappies or beer, regardless of what else is purchased.

Lift measures the ratio between the confidence of a rule and the expected confidence that the second product will be purchased. Lift is one measure of the strength of an effect. In our example, the confidence of the nappies-beer buying rule is 50%, whilst the expected confidence is 6% that an arbitrary customer will buy beer. So, the lift provided by the nappies-beer rule is 8.33 (= 50%/6%).

The nappies-beer rule could have been expressed explicitly in terms of lift as:

People buy nappies are 8.33 times more likely to buy beer too.

The interaction between nappies and beer is very strong. A key goal of an association or sequencing data mining exercise is to find rules that have a substantial lift like this.

Although rules with high confidence and support factors are important, those with lower levels may indicate less obvious patterns that suggest new marketing opportunities.

A key requirement for association analysis is the capacity to analyse very large databases. The numbers involved can be daunting: large retailers typically monitor over 100,000 product lines and handle millions of customer transactions each week.

Association analysis is not limited to retail applications. An insurance company, for example, might use it to look at links between health insurance claims for different conditions. It may also use sequential analysis to track the relationships between claims over time.

Sequential analysis is sometimes considered as a separate data mining operation, although we group it with association analysis. Sequential analysis looks for temporal links between purchases, rather than relationships between items in a single transaction. Sequential analysis will typically produce rules such as:

Ten per cent of customers who bought a tent bought a backpack within one month.

next previous