|
|
Good Risk |
Poor Risk |
Total |
|
High Debt |
1 |
1 |
2 |
|
Low Debt |
1 |
2 |
3 |
|
Total |
2 |
3 |
5 |
The subset with high debt has one good risk and one poor risk, and so:
entropy
![]()
= -(0.5*(-1) + 0.5*(-1))
= -(-0.5 -0.5)
= 1.00000
as we would expect for a set that is split down the middle.
The subset with low debt has one good risk and two poor risks, and so:
entropy
![]()
= -((1/3)*(-1.58496) + (2/3)*(-0.58496))
= -(-(0.52832) -(0.38998))
= 0.91830
Since there are altogether two high debts and three low debts, the average (or expected) entropy for the two subsets resulting from splitting by debt is:
(2/5)*1 + (3/5)*0.91830 = 0.95098
In other words, splitting by debt reduces the entropy by:
0.97095 - 0.95098 = 0.01998
Similar calculations show that splitting the training set by income reduces the entropy by 0.41998, whilst splitting by marital status reduces the entropy by 0.17095.
So, splitting by income is the most effective way of reducing the entropy in the training data set, and thereby producing as homogeneous subsets as possible:
©2005 Jatit