Document Clustering, classification and Data Mining

 

 

Good Risk

Poor Risk

Total

High Debt

1

1

2

Low Debt

1

2

3

Total

2

3

5

 

The subset with high debt has one good risk and one poor risk, and so:

 

                  entropy

 

                                =   -(0.5*(-1) + 0.5*(-1))

 

                                =   -(-0.5 -0.5)

 

                                =   1.00000

 

as we would expect for a set that is split down the middle.

 

The subset with low debt has one good risk and two poor risks, and so:

 

                  entropy

 

                                =   -((1/3)*(-1.58496) + (2/3)*(-0.58496))

 

                                =   -(-(0.52832) -(0.38998))

 

                                =   0.91830

 

Since there are altogether two high debts and three low debts, the average (or expected) entropy for the two subsets resulting from splitting by debt is:

 

                        (2/5)*1 + (3/5)*0.91830 = 0.95098

 

In other words, splitting by debt reduces the entropy by:

 

                        0.97095 - 0.95098 = 0.01998

 

Similar calculations show that splitting the training set by income reduces the entropy by 0.41998, whilst splitting by marital status reduces the entropy by 0.17095.

 

So, splitting by income is the most effective way of reducing the entropy in the training data set, and thereby producing as homogeneous subsets as possible:

 

 

 

next    previous


©2005 Jatit