стр. 10 |

Thus, many database algorithms can be employed in order to assist

mining processes especially in the data understanding and preparation phase. The

examples of those algorithms are data generalization, data normalization, missing data

detection and correction, data aggregation, data transformation, attribute-oriented

induction, and fractal and online analytical processing (OLAP).

4.4.2. Statistical algorithms

The distinction between statistics and data mining is indistinct as almost

all data mining techniques are derived from statistics field. It means statistics can be

used in almost all data mining processes including data selection, problem solving,

result presentation and result evaluation.

Statistical techniques that can be deployed in data mining processes

include mean, median, variance, standard deviation, probability, confident interval,

correlation coefficient, non-linear regression, chi-square, Bayesian theorem and Fourier

transforms.

4.4.3. Artificial Intelligence

Artificial intelligence (AI) is the scientific field seeking for the way to

locate intelligent behavior in a machine. It can be said that artificial intelligence

techniques are the most widely used in mining process. Some statisticians even think of

data mining tool as an artificial statistical intelligence. Capability of learning is the

greatest benefit of artificial intelligence that is most appreciated in the data mining field.

Artificial intelligence techniques used in data mining processes include

neural network, pattern recognition, rule discovery, machine learning, case-based

reasoning, intelligent agents, decision tree induction, fuzzy logic, genetic algorithm,

brute force algorithm and expert system.

4.4.4. Visualization

Visualization techniques are commonly used to visualize

multidimensional data sets in various formats for analysis purpose. It can be viewed as

higher presentation techniques that allow users to explore complex multi-dimensional

data in a simpler way. Generally, it requires the integration of human effort to analyze

and assess the results from its interactive displays. Techniques include audio, tabular,

- 31 -

scatter-plot matrices, clustered and stacked chart, 3-D charts, hierarchical projection,

graph-based techniques and dynamic presentation.

To separate data mining from data warehouse, online analytical processing

(OLAP) or statistics is intricate. One thing to be sure of is that data mining is not any of

them. The difference between data warehouse and data mining is quite clear. Though

there are some textbooks about data warehouse that devoted a few pages to data mining

topic, it does not mean that they took data mining as a part of data warehousing.

Instead, they all agreed that while data warehouse is a place to store data, data mining is

a tool to distil the value of such data. The examples of those textbooks are вЂњData

ManagementвЂќ (McFadden, Hoffer & Prescott, 1999) and вЂњDatabase Systems : A

Practical Approach to Design, Implementation, and ManagementвЂќ (Connolly, Begg &

Strachan, 1999).

One might argue that the value of data could be realized by using OLAP as

claimed in many data warehouse textbooks. OLAP, however, can be thought of as

another presentation tool that reform and recompile the same set of data in order to help

users find such value easier. It requires human interference in both stating presenting

requirements as well as interpreting the results. On the other hand, data mining uses

automated techniques to do those jobs.

As mentioned above, the differentiation between data mining and statistics is

much more complicated. It is accepted that the algorithms underlying data mining tools

and techniques are, more or less, derived from statistics. In general, however, statistical

tools are not designed for dealing with enormous amount of data but data mining tools

are. Moreover, the target users of statistical tools are statisticians while data mining is

designed for business people. This simply means that data mining tools are

enhancement of statistical tools that blend many statistical algorithms together and

possess a capability of handling more data in an automated manner as well as a user-

friendly interface.

The choice of an appropriate technique and timing depend on the nature of the

data to be analyzed, the size of data sets and the type of methods to be mined. A range

of techniques can be applied to the problems either alone or in combination. However,

when deploying sophisticated blend of data mining techniques, there are at least two

- 32 -

requirements that need to be met -- the ability to cross validate results and the

measurement criteria.

4.5. Methods of Data Mining Algorithms

Though nowadays data mining software packages are claimed to be more

automated, they still require some directions from users. Expected method of data

mining algorithm is one of those requirements. Therefore, in employing data mining

tools, users should have a basic knowledge of these methods. The types of data mining

methods can be categorized differently. However, in general, they fall into six broad

categories which are data description, dependency analysis, classification and

prediction, cluster analysis, classification and prediction, cluster analysis, outlier

analysis and evolution analysis. Details of each method are as follows:

4.5.1. Data Description

The objective of data description is to provide an overall description of

data, either in itself or in each class or concept, typically in summarized, concise and

precise form. There are two main approaches in obtaining data description -- data

characterization and data discrimination. Data characterization is summarizing general

characteristics of data and data discrimination, also called data comparison, is

comparing characters of data between contrasting groups or classes. Normally, these

two approaches are used in aggregated manner.

Though data description is one among many types of data mining

algorithm methods, usually it is not the real finding target. Often the data description is

analystвЂ™s first requirement, as it helps to gain insight into the nature of the data and to

find potential hypotheses, or the last one, in order to present data mining results. The

example of using data description as a presentation tool is the description of the

characteristics of each cluster that could not be identified by neural network algorithm.

Appropriate data mining techniques for this method are attribute-oriented

induction, data generalization and aggregation, relevance analysis, distance analysis,

rule induction and conceptual clustering.

- 33 -

4.5.2. Dependency Analysis

The purpose of dependency analysis, also called association analysis, is

to search for the most significant relationship across large number of variables or

attributes. Sometimes, association is viewed as one type of dependencies where

affinities of data items are described (e.g., describing data items or events that

frequently occur together or in sequence).

This type of methods is very common in marketing research field. The

most prevalent one is market-basket analysis. It analyzes what products customers

always buy together and presents in вЂњ[Support, Confident]вЂќ association rules. The

support measurement states the percentage of events occurring together comparing to

the whole population. The confident measurement affirms the percentage of the

occurrence of the following events comparing to the leading one. For example, the

association rule in figure 4.2 means milk and bread were bought together at 6% of all

transactions under analysis and 75% of customers who bought milk also bought bread.

Milk => bread [support = 6%, confident = 75%]

Figure 4.2: Example of association rule

Some techniques for dependency analysis are nonlinear regression, rule

induction, statistic sampling, data normalization, Apriori algorithm, Bayesian networks

and data visualization.

4.5.3. Classification and Prediction

Classification is the process of finding models, also known as classifiers,

or functions that map records into one of several discrete prescribed classes. It is

mostly used for predictive purpose.

стр. 10 |