Document Clustering, classification and Data Mining

Forecasting

Classification identifies a specific group or class to which an item belongs. A prediction based on a classification model will, therefore, be a discrete outcome, identifying a customer as a responder or non-responder, or a patient as having a high or low risk of heart failure. Forecasting, in contrast, concerns the prediction of continuous values such as share values, the level of the stock market, or the future price of a commodity such as oil.

Forecasting is often done with regression functions - statistical methods for examining the relationship between variables in order to predict a future value. Statistical packages, such as SAS and SPSS, provide a wide variety of such functions that can handle increasingly complex problems. However, such statistical functions usually require significant knowledge of the techniques used and the pre-conditions that apply to their implementation. Data mining tools can also provide forecasting functions. In particular, neural networks have been widely used for stock market forecasting.

An important distinction can be made between two different types of forecasting problem.

The simpler problem is that of forecasting a single continuous value based on a series of unordered examples. For example, predicting a person’s in-come based on various personal details. Many data mining tools can provide this form of forecasting using, for example, neural networks or, in some cases, decision trees.

A more complex problem is to predict one or more values based on a sequential pattern, such as the stock market level for the next 30 days based on figures for the previous six months. Fewer data mining tools support this form of forecasting. The limited support for extended time-series forecasting partly reflects the increased algorithmic complexity of the problem, and partly the need to prepare and present the data to the data mining tool in the right way and to provide output in the necessary format. Where it is supported, it usually requires analysts to do much more pre-processing of the data and post-processing of the results.

Techniques

A data mining operation is achieved using one of a number of techniques or methods. Each technique can itself be implemented in different ways, using a variety of algorithms.

Clustering Algorithms

Cluster analysis is the process of identifying the relationships that exist between items on the basis of their similarity and dissimilarity. Unlike classification, clustering does not require a target variable to be identified beforehand. A clustering algorithm takes an unbiased look at the potential groupings within a data set and attempts to derive an optimum delineation of items on the basis of those groups.

next previous