Introduction to Document Clustering

Now we are ready to apply the k-NN technique to see how it classifies our training data. Remember that we chose k=3, so we are interested in the three neighbours nearest to Peter. The distances in column 1 show that Peter, Sue, and Mary are Peter’s three nearest neighbours because they have the lowest distance scores.

But how can Peter be his own neighbour? Isn’t this somehow invalid, to use the information we have about Peter to predict an outcome for Peter? Well, in one sense it is, but this is absolutely no different than for any other model when it is tested with the training data. This is why we always need to use a separate test dataset to validate a model.

The Risk values for Peter’s three nearest neighbours (Peter himself, Sue, and Mary) are Good, Good, and Poor, respectively. The predicted Risk for Peter is the value that is most frequent among the k neighbours, or Good, in this case.

Who are Sue’s nearest 3 neighbours? Clearly Sue herself, but what about Peter, John and Fred, who are all the same distance (1) from Sue? We could include all three, exclude all three, or include all three with a proportionate vote (2/3 each in this case). The decision is entirely up to the implementers of the algorithm. For our example, we’ll use 2/3 vote each, resulting in votes of Good (Sue herself), 2/3 Good, 2/3 Poor, and 2/3 Poor, for a consensus of Good.

The following table enumerates the predictions from the 3-NN algorithm. The accuracy for 3-NN on the training set is 80 percent. Results from using 2-NN in this case would be exactly the same.

Name	Debt	Income	Married?	Risk	3-NN Prediction
Peter	High	High	Yes	Good	Good
Sue	Low	High	Yes	Good	Good
John	Low	High	No	Poor	-
Mary	High	Low	Yes	Poor	Poor
Fred	Low	Low	Yes	Poor	Poor

Since both the value of k and the distance metric have tremendous impact on the quality of the model it is important to choose them carefully.

What is a good value for k? The 1-NN algorithm tries to find one case in the training set that most closely matches the case we are trying to predict. When it is found, the predicted value is assumed to be the outcome of the matching case. For 1-NN accuracy on the training set will be 100 percent by definition. However, 1-NN, because it looks for only the one best match, is extremely susceptible to noise and will never reflect any kind of pattern in the data. Many commercial algorithms start with a default k value of 10. The best value for k will require some experimentation with the data at hand, comparing results for different k values against a test dataset.

k-NN does not make a learning pass through the data. Unfortunately, this does not make k-NN an especially efficient algorithm, particularly with large training datasets. That’s because all the processing is postponed until predictions are being made. Then each new case being predicted requires a complete pass through the training data to compute the distance between the target instance and each instance in the training set. The algorithm keeps track of the k nearest instances as it proceeds through the training set, predicting an outcome when the pass is complete.

Document Clustering, classification and Data Mining