Document Clustering, classification and Data Mining

While the nearest neighbour technique is simple in concept, the selection of k and the choice of distance metrics pose definite challenges. The absence of any real training process distinguishes the nearest neighbour technique from the other predictive methods. Both k and the distance metrics need to be set, and both will require some experimentation. The model builder will need to build multiple models, validating each with independent test sets to determine which values are appropriate.

Neural Networks

A key difference between neural networks and many other techniques is that neural nets only operate directly on numbers. As a result, any nonnumeric data in either the independent or dependent (output) columns must be converted to numbers before we can use the data with a neural net.

The following table shows our sample credit risk data with all two-valued categorical variables converted into values of either 0 or 1. High Debt and Income, Married=Yes, and Good Risk were all replaced by the value 1. Low Debt and Income, Married=No, and Poor Risk were replaced by 0. No conversion is necessary for the Name column because it is not used as an independent or dependent column.

Name	Debt	Income	Married?	Risk
Peter	1	1	1	1
Sue	0	1	1	1
John	0	1	0	0
Mary	1	0	1	0
Fred	0	0	1	0

Consider, first, a trained neural net that can be used to predict Good and Poor risks for our credit risk classification problem. This network contains six nodes, which we have marked A through F. The left-hand nodes (A, B and C) are input nodes and constitute the input layer. The input nodes correspond to the independent variable columns in the credit risk problem (Debt, Income, and Married).

The right-hand node (F) is the output node and makes up the output layer. In this case there is only one output node, corresponding to Risk, the dependent column. But neural network techniques in general do not restrict the number of output columns. That there can be multiple outputs representing multiple simultaneous predictions is one way that neural nets differ from most other predictive techniques.

next previous