SQL Server 2005 Mining Model Algorithms

前几天在看一些关于数据挖掘算法的东西,主要是我的老师waxdoll博客上的 谢邦昌先生的SQL Server 2005数据挖掘算法讲稿,然后从他那里偷了一篇文章过来(http://www.cnblogs.com/waxdoll/archive/2005/08/24/221406.html):

 

Data miningalgorithms are the foundation from which mining models are created. But why usemodels?  The use of models from prototypes to create the real thing is thesame reason why we need to create models to base our data miningapproach.  I still remember my differential calculus in this sense. We studied the subject as a prerequisite for engineering mathematics simplybecause if you were to create a solid object, say a spherical tank, you need toknow which mathematical model can be used to get what you want.  It's thesame thing in using mining models.

In order for usto understand better, let me give you the different mining model algorithms andtheir uses.

MicrosoftDecision Trees

This algorithmsupports both classification and regression and works well for predictivemodeling.  You can use this model to answer questions like "Whatcauses customers to buy this product?"," Who among my existingcustomers do I need to focus on to generate more revenues?" and similarrequirements. 

In managementscience, decision trees are pictorial networks of alternative courses of actionshowing the possible outcomes of different choices, taking into accountprobabilities, costs and returns. Decision trees enable a manager to set outthe consequences of choices, ensuring that he has considered all possibilitiesand to assess the likelihood of each different possibility and to assess theresult of each possibility in terms of cost and profit. In Microsoft SQL Server2005 Data Mining, you are actually automating all of these tasks.  

MicrosoftClustering

This algorithmuses iterative to group records from a data set into clusters containingsimilar characteristics.This model is used to answer questions like, “How can Idifferentiate my customers?”  You can also think in terms of grouping likeall affluent people in the US happen to own an older car, an older house,either saved or invested money and married to one spouse(from the book TheMillionaire Next Door by Dr. Thomas Stanley). By coming up with such results,you can probably use this information to change your busines approach (in thiscase, what can I do to be the next American millionaire? Kasi di applicableyung ibang findings nung author sa Philippine setting.)

MicrosoftNaїve Bayes

This is anotherclassification algorithm similar to the decision trees.  Let’s consider anexample to further explain this algorithm.  Let’s say you are an isaw andbarbecue vendor who happened to use database to track your customers and theirbuying patterns(syempre, aside from isaw, merong tenga, balun-balunan, balut,etc – name it – Filipino specialty kaya mabili).  You gathered data fromyour customers like gender, age, occupation, and probably waistline.You cantrain this algorithm to use the given set of data against a classification, sayitem purchased.  This algorithm (ang hirap kasing i-type nung Naїve Bayeskaya algorithm na lang) can be used to estimate the probability that based onthe data gathered, what items can be purchased by a given customer. So, basedon the outcome, if Erap drops by your barbecue stand, knowing his age, gender,occupation, and waistline, you can predict what he will buy from you (kaya langmasisira yata ang predictions ng algorithm na ito okay Erap sa lakas nyanguminom at sa dami ng sinasama nya pag umiinomsya…he…he)      

Microsoft Time Series

This algorithm uses a linear regression decision tree approach toanalyze time-related data.  Using this algorithm you can forecast how muchrevenue you will be having for the next year based on the sales you had fromthe isaw and barbecue stand, what are inventory levels next month, and, if youhave additional branch, you can predict based on the outcome of this model theprobable revenue on the other branches. A typical application of this is theone I saw on the web wherein you simply key in your age, gender, occupation,lifestyle and it predicts how many days left before you die.  Although Ireally don’t believe in the results because no one knows when you willdefinitely leave this planet, it makes use of this concept. 

Microsoft Association Rules

This algorithm builds rules describing which items are most likely toappear together in a transaction. You normally see this in Amazon.com’s websitewherein they cross-sell products. You will read, “Customers who bought thisproduct also bought these…” then the recommendations.  From a sellingpoint of view, you can have a basis for suggestive selling like what thefast-food people do (“Would you like to try our new apple flavoredgravy?”)  What they do is simply suggest. Implementing this algorithmmakes intelligent suggesting.  There will be a higher probability thatwhat you suggested will be taken into consideration based on existingdata. 

Microsoft Sequence Clustering

This algorithm analyzes sequence-oriented data that containsdiscrete-valued series and is a hybrid of sequence and clustering algorithms.Usually the sequence attribute in the series holds a set of events with aspecific order. By analyzing the transition between states of the sequence, thealgorithm can predict future states in related sequences.  This answersthe questions “How do I differentiate my customers?” or “How do I know whichevents caused the outage of my servers?”  

Microsoft Nueral Networks

A neural network is an interconnected group of artificial or biologicalneurons. Similarto the Microsoft Decision Trees algorithm provider, given each state of thepredictable attribute, the algorithm calculates probabilities for each possiblestate of the input attribute. The algorithm provider processes the entire setof cases, iteratively comparing the predicted classification of the cases withthe known actual classification of the cases. The errors from the initialclassification of the first iteration of the entire set of cases is fed backinto the network, and used to modify the network's performance for the nextiteration, and so on. You can later use these probabilities to predict anoutcome of the predicted attribute, based on the input attributes. 

One of the major advantages of neural networks is that,theoretically, they are capable of approximating any continuous function, andthus the researcher does not need to have any hypotheses about the underlyingmodel, or even to some extent, which variables matter. An importantdisadvantage, however, is that the final solution depends on the initialconditions of the network, and, as stated before, it is virtually impossible to"interpret" the solution in traditional, analytic terms, such asthose used to build theories that explain phenomena.  This algorithm helpsyou answer questions like “How long will this asset be in service or totallydepreciated?” or “Will this customer, who is a recipient of a target mailingcampaign, buy a product?”

Microsoft Linear Regression

The Microsoft Linear Regression algorithm is a particular configurationof the Microsoft Decision Trees algorithm, obtained by disabling splits (thewhole regression formula is built in a single root node). The algorithmsupports the prediction of continuous attributes. One application I can thinkof in this case is predicting how related is stress from heart attack or howclose corruption to people in the government is.  Business applicationsthat could take advantage of this algorithm is how closely related is ourmarketing strategy to revenue for a particular product. 

Microsoft Logistic Regression

This algorithmis a particular configuration of the Microsoft Neural Network algorithm,obtained by eliminating the hidden layer. The algorithm supports the predictionof both discrete and continuous attributes. An application of this isidentifying the high cost users of medical care.  In countries like the USand Canada where medical care is partly shouldered by the government, analysisis needed to determine what factors are contributing to this so the governmentcan generate policies on medical claims. 

 

posted @ 2008-07-31 18:05  spoony  阅读(543)  评论(0编辑  收藏  举报