Introduction to Document Clustering

To classify a new case, the algorithm computes the distance from the new case to each case (row) in the training data. The new case is predicted to have the same outcome as the predominant outcome in the k closest cases in the training data.

Consider a bank that seeks to minimise loan defaults. Loan officers must be able to identify potential credit risks during the loan approval cycle. The problem is one of simple classification: to predict whether or not an applicant will be a good or poor credit risk. The following table shows the training dataset for this problem. For simplicity the size has been limited to five records – a very small dataset.

Name	Debt	Income	Married?	Risk
Peter	High	High	Yes	Good
Sue	Low	High	Yes	Good
John	Low	High	No	Poor
Mary	High	Low	Yes	Poor
Fred	Low	Low	Yes	Poor

This dataset contains information about people to whom the institution previously loaned money. The lender determined if each applicant was a good or poor credit risk. Because Risk is what we wish to predict, it is the dependent column, also called the target variable. The other columns used to build the model are known as independent columns. In this example they include debt level (High or Low), income level (High or Low), and marital status (Yes or No). The Name column will be ignored because it is highly unlikely that a person’s name affects his credit risk. In our example, all the columns, except Name, have two possible values. The restriction to two values is only to keep the example simple.

We will use k=3, or 3-NN. For our distance measure we will use a simple metric that is computed by summing scores for each of the three independent columns, where the score for a column is 0 if the values in the two instances are the same, and 1 if they differ.

We can now compute the distance between Peter and Sue. The three column scores for Peter and Sue are 1, 0, and 0 because they have different values only in the Debt column. The distance between Peter and Sue – the sum of these scores – is equal to 1. If two cases have all values in common, their distance is zero.

	Peter	Sue	John	Mary	Fred
Peter	0	1	2	1	2
Sue	1	0	1	2	1
John	2	1	0	3	2
Mary	1	2	3	0	1
Fred	2	1	2	1	0

This table summarises all the distances between all the records in our training dataset. Note that this matrix is symmetrical; only the numbers below (or above) the diagonal of zeroes are needed. However, presenting it this way makes it easy to see all of the distances for any one case (row) at a glance.

Document Clustering, classification and Data Mining