Introduction to Document Clustering

This makes it the most efficient data mining technique. However, Naïve-Bayes does not handle continuous data, so any independent or dependent variables that contain continuous values must be binned or bracketed. For instance, if one of the independent variables is age, the values must be transformed from the specific value into ranges such as "less than 20 years," "21 to 30 years," "31 to 40 years" and so on.

Using Naïve-Bayes for classification is a fairly simple process. During training, the probability of each outcome (dependent variable value) is computed by counting how many times it occurs in the training dataset. This is called the prior probability. For example, if the Good Risk outcome occurs 2 times in a total of 5 cases, then the prior probability for Good Risk is 0.4. You can think of the prior probability in the following way: "If I know nothing else about a loan applicant, there is a 0.4 probability that the applicant is a Good Risk." In addition to the prior probabilities, Naïve-Bayes also computes how frequently each independent variable value occurs in combination with each dependent (output) variable value. These frequencies are then used to compute conditional probabilities that are combined with the prior probability to make the predictions.

Consider the credit risk classification problem whose goal is to be able to predict whether an applicant will be a Good or a Poor Credit Risk. In our example, all the columns – both independent and dependent -- are categorical, so we do not have to convert any values to categorical variables by binning (grouping) values.

From the sample data, we cross-tabulate counts of each Risk outcome (Good or Poor) and each value in the independent variable columns. For example, row 3 reports two cases where Income is High and Risk is Good and one case where Income is High and Risk is Poor.

		Counts	Counts	Likelihood
Independent Variable	Value	Good Risk	Poor Risk	given Good Risk	given Poor Risk
Debt	High	1	1	0.50	0.33
Debt	Low	1	2	0.50	0.67
Income	High	2	1	1.00	0.33
Income	Low	0	2	0	0.67
Married	Yes	2	2	1.00	0.67
Married	No	0	1	0	0.33
Total by Risk		2	3

We can easily compute the likelihood that one of the independent variables has a particular value, given a known risk level, by using the count and dividing by the "Total by Risk" number (on the bottom row). For example, the likelihood that a Good Risk has High Income is 1.00 (see row 3). That’s because both instances of Good Risk have High Income. In the same way, the conditional probability that a Poor Risk has a Low Income is 0.67, because two out of three Poor risks have Low Income (see row 4).

The bottom row is also used to compute the prior probabilities for Good and Poor Risk. In our data the prior probability for Good is 0.40 (two of five cases) and the prior probability for Poor is 0.60 (three of five cases).

Given a particular case, we can compute a score, related to posterior probability, for both values of

Document Clustering, classification and Data Mining