Introduction to Document Clustering

risk level simply by multiplying the prior probability for risk level by all the likelihood figures from the above table for the independent variables (Bayes’ Theorem). The highest scoring value becomes the predicted value.

Name	Debt	Income	Married?	Risk Actual	Good Risk Score	Poor Risk Score	Risk Predicted
Peter	High	High	Yes	Good	0.2	0.04	Good
Sue	Low	High	Yes	Good	0.2	0.09	Good
John	Low	High	No	Poor	0	0.04	Poor
Mary	High	Low	Yes	Poor	0	0.09	Poor
Fred	Low	Low	Yes	Poor	0	0.18	Poor

For example, in the above table the first row for Peter in the training set has High Debt, High Income and Married is Yes. In the first table the likelihoods associated with these values and Good Risk are 0.5, 1.0 and 1.0 respectively (see rows 1, 3 and 5). The product of these three numbers and the prior probability for Good (0.40) is 0.20 (0.50 x 1 x 1 x 0.40). For Poor Risk the probabilities (also from rows 1, 3 and 5) are 0.33, 0.33, and 0.67. With a prior probability for Poor Risk of 0.60, the score for Poor Risk is 0.044 (0.33 x 0.33 x 0.67 x 0.60). Because the score for Good is higher, we predict that Peter will be a Good risk.

The second table presents the actual risk, the scores for Good risk and Poor risk, and the predicted risk for all cases in the sample data. With all outcomes predicted correctly, the algorithm has 100 percent accuracy on the training set. Although this is a good sign, we must also validate the model by using it to predict the outcomes for separate test data, but we will skip that step.

The scores that we computed can also be easily converted to real posterior probabilities, by dividing each score by the sum of all scores for that case. For example, the posterior probability that Peter is a Good Risk is approximately 82 percent (0.2 divided by 0.244, which is the sum of the Good Risk value of 0.2 and the Poor Risk value of 0.044.). The posterior probability that Peter is a Poor Risk is 18 percent.

Note that we don’t need to know values for all the independent variables to make a prediction. In fact, if we know none of them, we can still predict using just the prior probability. If we know only the value for Income, we can use the conditional probabilities associated with Income to modify the prior probabilities, and make a prediction. This ability to make predictions from partial information is a significant advantage of Naïve-Bayes.

The use of Bayes’ Theorem in computing the scores and posterior probabilities in this way is valid only if we assume a statistical independence between the various independent variables such as debt and income. (Hence the term "naïve".) Despite the fact that this assumption is usually not correct, the algorithm appears to have good results in practice.

Decision Trees

Decision trees are one of the most common data mining technique and are by far the most popular in tools aimed at the business user. They are easy to set up, their results are understandable by an end-user, they can address a wide range of classification problems, they are robust in the face of different data distributions and formats, and they are effective in analysing large numbers of fields.

Document Clustering, classification and Data Mining

Decision Trees