Chao - Testing

testing testing and testing

导航

The Java Data Mining API[转]

 

by Benoy Jose

Introduction:

Data Mining is a very important process used by most companies today. It includes sifting through tons of business data for potential leads, sales analysis, audit, data warehousing, business intelligence and many other functions. Most companies have a variety of data sources like mainframe, databases, and files etc. , where data is stored. Grouping and analyzing data from disparate data sources now becomes a big problem. Most of these individual data source providers have some API through which data analysis can be done on the data in the database. Imagine the plight of a companies that have data in different data sources like Oracle, mainframe etc. and have to do analysis of their own data. They would have to do individual analysis of the data in each database and then consolidate the results. This is where the JDM fits in. The Java Data Mining API (JDM) proposes a pure Java API for developing data mining applications. The idea is to have a common API for data mining that can be used by clients without users being aware or affected by the actual vendor implementations for data mining.

Architecture:

The JDM architecture consists of three logical components, the API, the data mining Engine (DME), and the metadata repository (MR). The API is the exposed programming interface that provides access to the services provided by the DME. The API shields the data mining user from the actual implementation in the DME and any associated sub components used by the DME. The DME is the engine that provides services that can be used by users through the API defined above. The DME can be implemented as a server in which case it is called a Data Mining Server. The third component is the metadata repository (MR) which is used to persist data mining objects. These persisted data mining objects are again used by the DME for data mining operations. The metadata repository can exist as a flat file system or can be a relational database. The three logical components can be grouped into one physical system or they can exist independently as separate components. Apart from these JDM implementers can add additional components and tools to enhance the vendor implementation of the JDM. But these additional components are not defined in the JDM specification.

Data Mining Terms:

Listed below are the principal objects and terms used in the JDM specification.

Connection: Users can get a connection object to access the DME by using a connection factory. The connection object also provides access to the objects present in the metadata repository(MR). Connection objects can create, retrieve and delete mining objects present in the MR.

Task: A task object is used to define tasks that need to be performed by the DME. The JDM defines tasks to build, to apply models and to test. Additionally it provides tasks to compute statistics, importing and exporting data. Tasks can be grouped as batches and can be scheduled for execution by the application.

Execution Handle and Status: An asynchronous task produces an execution handle that can be used to track executing tasks or completed tasks. An execution handle also provides a mechanism to block task until it is completed thereby making it a synchronous task. An execution handle can also be used to terminate an executing task, however the actual implementation is left to the vendor who implements the JDM specification.

Physical Data Set: Physical data set is the actual data that is being used as input in the data mining operation. The physical data set object can represent relational tables, star schemas, structured files, XML files and olap cubes. In the first release of the specification only table and file datas are supported.

Physical Data Record: Physical data records are used for single case scoring both in input and output operations.

Build Settings: Build settings are used to set the parameters required for building the model. Build settings allow users to specify the algorithm to be used for building the model. They have default values and they are used if the user omits parameters.

Algorithm: An algorithm is applied to a set of data to produce a model. JDM does not define a large number of algorithms but provides mechanisms to add new ones. An algorithm can optionally have a setting that can be used for setting parameters for an algorithm.

Algorithm settings: Algorithm settings are used to set parameters for a particular algorithm that is selected. This helps in fine tuning the algorithm.

Model: A model is produced when an algorithm is applied to a set of data. In the first release models would be read only and will be stored in the metadata repository. A model is specific to the algorithm used to create it and is related to the task that created the model.

Model Signature: A model signature defines the input parameters required to use the model. The signature consists of attributes like name, data type, and type.

Model Detail: A model detail object represents the detailed state of a model. The details are specific to the algorithm used and changes when the model or algorithm changes.

Attribute Statistics set: An attribute statistics set contains information about statistics on a set of attributes. It is created when computing statistics are done on a data set object.

Confusion matrix: A Confusion matrix is produced when a model is being tested. It tells the user of the model on how well the model is doing in predicting values and where it is making mistakes.

Lift: Lift calculates the ratio between the results computed with and without the predictive model. Lift provides a measure of success of the predictive model.

Cost matrix: A cost matrix defines a tabular representation of the cost associated with predicted values and actual values.

Data Mining Functions:

Data mining functions can be classified as supervised and unsupervised. A supervised mining function predicts a value based on a pre set target. The target needs to be defined and it will determine how well it is matching the target values. Unsupervised functions do not need a target and are used to identify structures and, relations in data. Another classification is based on how is the data mining is done. They can be descriptive or predictive. Descriptive data mining creates a dataset that is concise and presents general properties about the data. The predictive data mining performs inferences on the available set of data and tries to predict the outcome for new data sets.

The above categories are for the classification of the functions provided by JDM. The actual data mining functions are described below.

Classification is a type of supervised function where an algorithm builds a model based on a set of predefined predictors used to predict the target. It is usually used in business modeling and credit analysis.

The second type of function is Regression, which is a type of supervised function. Regression is usually used in financial forecasting and drug response modeling.

The third type is Attribute importance, which can be both a supervised and unsupervised function. Attribute importance identifies which attributes are important for building a model. This improves the time to build a model and the accuracy of the model by eliminating noise attributes from the model. The function classifies all the attributes by their relevant importance in building the model and allows the user to choose the most important attributes that are needed to build the model.

The fourth type is Clustering. Clustering identifies clusters in the data. A cluster is a collection of objects that are similar to each other. Clustering is used primarily in customer segmentation, product groupings, and text mining.

The fifth type of function is Association. Association looks for patterns of relationships in data. It looks for relationships between a set of data and looks for patterns of this kind throughout. This is useful in analyzing consumer behavior.

posted on 2004-10-25 04:58  Chech  阅读(999)  评论(0编辑  收藏  举报