Document Clustering, classification and Data Mining

The two middle nodes (D and E) are the hidden nodes and constitute a single hidden layer. The number of hidden nodes and, for that matter, the number of hidden layers, are set at the user’s discretion. The number of hidden nodes often increases with the number of inputs and the complexity of the problem. Too many hidden nodes can lead to overfitting (where the network has become so specialised that it deals perfectly with the training set but is pretty hopeless with new cases), and too few hidden nodes can result in models with poor accuracy. Finding an appropriate number of hidden nodes is an important part of any data mining effort with neural nets. The number of input, hidden, and output nodes is referred to as the neural net topology or the network architecture.

The diagram shows weights on the arrows between the nodes. Typically, there are no weights on the arrows coming into the input layer or coming out of the output layer. The values of the other weights are determined during the neural net training or learning process. Note that weights can be both positive and negative. Neural net algorithms usually restrict weights to a narrow range such as between plus and minus 1 or between plus and minus 10. Although weights are typically real numbers we have used integers to simplify our calculations.

The heart of the neural net algorithm involves a series of mathematical operations that use the weights to compute a weighted sum of the inputs at each node. In addition, each node also has a squashing function that converts the weighted sum of the inputs to an output value. For our neural net we will use a very simple squashing function:

if the weighted sum of the inputs is greater than zero, the output is 1, otherwise the output is 0.

Equations for the output values at nodes D, E and F can now be written as follows:

D = If (A + 2B - C) > 0 Then 1 Else 0

E = If (-2A + 2B - 5C) > 0 Then 1 Else 0

F = If (D - 2E) > 0 Then 1 Else 0

The following table shows the sample data, with the three independent variables (Debt, Income, and Married) converted to numbers, the actual risk, and the computed values for nodes D, E, and

next previous