Document Clustering, classification and Data Mining

	Good Risk	Poor Risk	Total
High Debt	1	1	2
Low Debt	1	2	3
Total	2	3	5

The subset with high debt has one good risk and one poor risk, and so:

entropy

= -(0.5*(-1) + 0.5*(-1))

= -(-0.5 -0.5)

= 1.00000

as we would expect for a set that is split down the middle.

The subset with low debt has one good risk and two poor risks, and so:

entropy

= -((1/3)*(-1.58496) + (2/3)*(-0.58496))

= -(-(0.52832) -(0.38998))

= 0.91830

Since there are altogether two high debts and three low debts, the average (or expected) entropy for the two subsets resulting from splitting by debt is:

(2/5)*1 + (3/5)*0.91830 = 0.95098

In other words, splitting by debt reduces the entropy by:

0.97095 - 0.95098 = 0.01998

Similar calculations show that splitting the training set by income reduces the entropy by 0.41998, whilst splitting by marital status reduces the entropy by 0.17095.

So, splitting by income is the most effective way of reducing the entropy in the training data set, and thereby producing as homogeneous subsets as possible:

next previous