Document Clustering, classification and Data Mining

· provided by Dun & Bradstreet can yield a prioritized list of prospects by region.

· A large consumer package goods company can apply data mining to improve its sales process to retailers. Data from consumer panels, shipments, and competitor activity can be applied to understand the reasons for brand and store switching. Through this analysis, the manufacturer can select promotional strategies that best reach their target customer segments.

Operations

An application that uses data mining technology will implement one or more data mining operations (sometime referred to as data mining ‘tasks’). Each operation reflects a different way of distinguishing patterns or trends in a complex data set.

Classification and prediction

Classification is the operation most commonly supported by commercial data mining tools. It is an operation that enables organizations to discover pat-terns in large or complex data sets in order to solve specific business problems.

Classification is the process of sub-dividing a data set with regard to a number of specific outcomes. For example, we might want to classify our customers into ‘high’ and ‘low’ categories with regard to credit risk. The category or ‘class’ into which each customer is placed is the ‘outcome’ of our classification.

A crude method would be to classify customers by whether their income is above or below a certain amount. A slightly more subtle approach tries to find a linear relationship between two different factors - such as income and age -to divide a data set into two groups. Real-world classification problems usually involve many more dimensions and therefore require a much more complex delimitation between different classes.

An example of classification A financial services organization wishes to identify those customers likely to be interested in a new investment opportunity. It has sold a similar product before and has historical data showing which of its customers responded to the previous offer. The aim is to understand which factors identify likely responders to the offer, so that the marketing and sales effort can be targeted more efficiently

There is a field in the customer record that is set to true or false, depending on whether a customer did or did not respond to the offer. This field is called the ‘target field’ or ‘dependent variable’ for the classification. The aim is to analyse the way other attributes of the customer (such as level of income, type of job, age, sex, marital status, number of years as a customer, and other types of investments or products purchased) influence the class to which they belong (as indicated by the target field). This information will usually be stored in other fields in the customer record. The various fields included in the analysis are called ‘independent’ or ‘predictor’ fields or variables.

Techniques for classification How the data mining tool analyses this data, and the type of information it provides, depends on the techniques it uses. The most common techniques for classification are decision trees and neural networks.

next previous