Document Clustering, classification and Data Mining

 

risk level simply by multiplying the prior probability for risk level by all the likelihood figures from the above table for the independent variables (Bayes’ Theorem). The highest scoring value becomes the predicted value.

 

Name

Debt

Income

Married?

Risk
Actual

Good Risk
Score

Poor Risk
Score

Risk Predicted

Peter

High

High

Yes

Good

0.2

0.04

Good

Sue

Low

High

Yes

Good

0.2

0.09

Good

John

Low

High

No

Poor

0

0.04

Poor

Mary

High

Low

Yes

Poor

0

0.09

Poor

Fred

Low

Low

Yes

Poor

0

0.18

Poor

 

For example, in the above table the first row for Peter in the training set has High Debt, High Income and Married is Yes. In the first table the likelihoods associated with these values and Good Risk are 0.5, 1.0 and 1.0 respectively (see rows 1, 3 and 5). The product of these three numbers and the prior probability for Good (0.40) is 0.20 (0.50 x 1 x 1 x 0.40). For Poor Risk the probabilities (also from rows 1, 3 and 5) are 0.33, 0.33, and 0.67. With a prior probability for Poor Risk of 0.60, the score for Poor Risk is 0.044 (0.33 x 0.33 x 0.67 x 0.60). Because the score for Good is higher, we predict that Peter will be a Good risk.

 

The second table presents the actual risk, the scores for Good risk and Poor risk, and the predicted risk for all cases in the sample data. With all outcomes predicted correctly, the algorithm has 100 percent accuracy on the training set. Although this is a good sign, we must also validate the model by using it to predict the outcomes for separate test data, but we will skip that step.

 

The scores that we computed can also be easily converted to real posterior probabilities, by dividing each score by the sum of all scores for that case. For example, the posterior probability that Peter is a Good Risk is approximately 82 percent (0.2 divided by 0.244, which is the sum of the Good Risk value of 0.2 and the Poor Risk value of 0.044.). The posterior probability that Peter is a Poor Risk is 18 percent.

 

Note that we don’t need to know values for all the independent variables to make a prediction. In fact, if we know none of them, we can still predict using just the prior probability. If we know only the value for Income, we can use the conditional probabilities associated with Income to modify the prior probabilities, and make a prediction. This ability to make predictions from partial information is a significant advantage of Naïve-Bayes.

 

The use of Bayes’ Theorem in computing the scores and posterior probabilities in this way is valid only if we assume a statistical independence between the various independent variables such as debt and income. (Hence the term "naïve".) Despite the fact that this assumption is usually not correct, the algorithm appears to have good results in practice.

 

 

 

Decision Trees

 

Decision trees are one of the most common data mining technique and are by far the most popular in tools aimed at the business user. They are easy to set up, their results are understandable by an end-user, they can address a wide range of classification problems, they are robust in the face of different data distributions and formats, and they are effective in analysing large numbers of fields.

 

 

next    previous


©2005 Jatit