Document Clustering, classification and Data Mining

The second of these subsets (low income) consists of 100% poor risks. Since it is totally homogeneous (and has an entropy of 0), there is no more work to be done on that branch. But the first branch is a mix of good and poor risks. It has an entropy of 0.91830 and needs to be split by a second independent variable. Should it be debt or marital status?

Consider, first, splitting the high income set by debt. The following is a cross-tabulation of the training set by debt and by risk:

	Good Risk	Poor Risk	Total
High Debt	1	0	1
Low Debt	1	1	2
Total	2	1	3

The subset with high debt has one good risk and no poor risk. It is completely homogeneous and has an entropy of 0.

The subset with low debt has one good risk and one poor risk. It is split down the middle and has an entropy of 1.

Since there are altogether one high debts and two low debts, the average (or expected) entropy for the two subsets resulting from splitting the high incomes by debt is:

(1/3)*0 + (2/3)*1 = 0.66667

In other words, splitting the high incomes by debt reduces the entropy of this set by:

0.91830 - 0.66667 = 0.25163

On the other hand if we use marital status, we obtain two completely homogeneous subsets and so the entropy is brought down to zero. Marital status is obviously the one to go with.

next previous