Typically, the model construction begins with two types of data sets --

training and testing. The training data sets, with prescribed class labels, are fed into the

model so that the model is able to find parameters or characters that distinguish one

class from the other. This step is called learning process. Then, the testing data sets,

without pre-classified labels, are fed into the model. The model will, ideally,

automatically assign the precise class labels for those testing items. If the results of

- 34 -

testing are unsatisfactory, then more training iterations are required. On the other hand,

if the results are satisfactory, the model can be used to predict the classes of target items

whose class labels are unknown.

This method is most effective when the underlying reasons of labeling

are subtle. The advantage of this method is that the pre-classified labels can be used as

the performance measurement of the model. It gives the confidence to the model

developer of how well the model performs.

Appropriate techniques include neural network, relevance analysis,

discriminant analysis, rule induction, decision tree, case-based reasoning, genetic

algorithms, linear and non-linear regression, and Bayesian classification.

4.5.4. Cluster analysis

Cluster analysis addresses segmentation problems. The objective of this

analysis is to separate data with similar characteristics from the dissimilar ones. The

difference between clustering and classification is that while clustering does not require

pre-identified class labels, classification does. That is why classification is also called

supervised learning while clustering is called unsupervised learning.

As mentioned above, sometimes it is more convenient to analyze data in

the aggregated form and allow breaking down into details if needed. For data

management purpose, cluster analysis is frequently the first required task of the mining

process. Then, the most interesting cluster can be focused for further investigation.

Besides, description techniques may be integrated in order to identify the character

providing best clustering.

Examples of appropriate techniques for cluster analysis are neural

networks, data partitioning, discriminant analysis and data visualization.

4.5.5. Outlier Analysis

Some data items that are distinctly dissimilar to others, or outliers, can be

viewed as noises or errors which ordinarily need to be drained before inputting data sets

into data mining model. However, such noises can be useful in some cases, where

unusual items or exceptions are major concerns. Examples are fraud detection, unusual

usage patterns and remarkable response patterns.

- 35 -

The challenge is to distinguish the outliers from the errors. When

performing data understanding phase, data cleaning and scrubbing is required. This

step includes finding erroneous data and trying to fix them. Thus, the possibility to

detect interesting differentiation might be diminished. On the other hand, if the

incorrect data remained in the data sets, the accuracy of the model would be

compromised.

Appropriate techniques for outlier analysis include data cube,

discriminant analysis, rule induction, deviation analysis and non-linear regression.

4.5.6. Evolution Analysis

This method is the newest one. The creation of evolution analysis is to

support the promising capability of data warehouses which is data or event collection

over a period of time. Now that business people came to realize the value of trend

capture that can be applied to the time-related data in the data warehouse, it attracts

increasing attention in this method.

Objective of evolution analysis is to determine the most significant

changes in data sets over time. In other words, it is other types of algorithm methods

(i.e., data description, dependency analysis, classification or clustering) plus time-

related and sequence-related characteristics. Therefore, tools or techniques available for

this type of methods include all possible tools and techniques of other types as well as

time-related and sequential data analysis tools.

The examples of evolution analysis are sequential pattern discovery and

time-dependent analysis. Sequential pattern discovery detects patterns between events

such that the presence of one set of items is followed by another (Connolly, 1999, 965).

Time-dependent analysis determines the relationship between events that correlate in a

definite of time.

Different types of methods can be mined in parallel to discover hidden or

unexpected patterns, but not all patterns found are interesting. A pattern is interesting if

it is easily understood, valid, potentially useful and novel (Han & Kamber, 2000, 27).

Therefore, analysts are still needed in order to evaluate whether the mining results are

interesting.

- 36 -

To distinguish interesting patterns, users of data mining tools have to solve at

least three problems. First, the correctness of patterns has to be measured. For

example, the measurement of dependency analysis is “[Confident, Support]” value. It is

easier for the methods that have historical or training data sets to compare the

correctness of the patterns with the real ones; i.e., classification and prediction method.

For those methods that training data sets are not available, then the professional

judgement of the users of data mining tools is required.

Second, the optimization model of patterns found has to be created. For

example, the significance of “Confident” versus “Support” has to be formulated. To put

it in simpler terms, it is how to tell which is better between higher “Confident” with

lower “Support” or lower “Confident” with higher “Support”.

Finally, the right point to stop finding patterns has to be specified. This is

probably the most challenging problem. This leads to two other problems -- how to tell

the current optimized pattern is the most satisfactory one and how to know it can be

used as a generalized pattern on other data sets. In short, while trying to optimize the

patterns, the over-fitting problem has to be taken into account as well.

4.6. Examples of Data Mining Algorithms

As mentioned above, there are plenty of algorithms used to mine the data. Due

to the limited of space, this section is focused on the most frequently used and

widespread recognized algorithms that can be indisputable thought of as data mining

algorithms; neither pure statistical, nor database algorithms. The examples include

Apriori algorithms, decision trees and neural networks. Details of each algorithms are

as follows:

4.6.1. Apriori Algorithms

Apriori algorithm is the most frequently used in the dependency analysis

method. It attempts to discover frequent item sets using candidate generation for

Boolean association rules. Boolean association rule is a rule that concerns associations

between the presence or absence of items (Han & Kamber, 2000, 229).

The steps of Apriori algorithms are as follows:

(a) The analysis data is first partitioned according to the item sets.

- 37 -

(b) The support count of each item set (1-itemsets), also called

Candidate, is performed.

(c) The item sets that could not satisfy the required minimum support

count are pruned. Thus creating the frequent 1-itemsets (a list of item

sets that have at least minimum support count).

(d) Item sets are joined together (2-itemsets) to create the second-level

candidates.

(e) The support count of each candidate is accumulated.

(f) After pruning unsatisfactory item sets according to minimum support

count, the frequent 2-itemsets is created.

(g) The iteration of (d), (e) and (f) are executed until no more frequent k-

itemsets can be found or, in other words, the next frequent k-itemsets