决策树算法和它的实现方式

第三章 决策树

3.1 算法简介

决策树就像一个数据流程图,区别在于决策树可以在不熟悉的数据集中,提取规则,进而构造出一棵决策树

优点:计算复杂度不高,输出的结果易于理解, 对中间缺省值不敏感,可以处理不想关的特征

缺点:容易发生过度匹配

适用的数据类型:数值型和标称型

3.2 算法的一般流程

1.收集数据

2.处理数据:构造树需要使用标称数据,因此如果是数值型数据的话必须进行离散化

3.分析数据

4.训练算法:利用训练数据构造出一棵决策树

5.测试算法

6.使用算法

3.3相关的概念

决策树算法的关键在于怎么选择分类的特征,这里使用的是id3算法。所谓的id3算法指的是选择信息增益最大的特征作为划分的特征。

1.什么是信息增益

信息增益是香农提出来的,指的是划分数据之前与之后信息发生的变化。

2.怎么计算信息的值

l(xi) = - logp(xi)

P(xi) = xi / x (就是数据集中某一个类的数量除以总的数量)

3.什么计算熵的值

H = p(xi)l(xi) (这里的i的范围是指所有的分类)

3.4算法的实现过程

  1.获取香农熵

 

rom math import log
import operator
import pickle

# calculate the entropy
def calcshannonent(dataset):
    datasetnum = len(dataset)
    datalable = {}
    for eachdata in dataset:
        fetchlable = eachdata[-1]
        if fetchlable not in datalable.keys():
            datalable[fetchlable] = 0
        datalable[fetchlable] += 1
    entroy = 0.0
    for eachlable in datalable:
        prob = float(datalable[eachlable])/datasetnum
        entroy -= prob*log(prob, 2)
    return entroy

 

  2.划分数据集

 split the data set
def splitdataset(dataset, axis, value):
    retdataset = []
    for each in dataset:
        if each[axis] == value:
            splitdata = each[:axis]
            splitdata.extend(each[axis+1:])
            retdataset.append(splitdata)
    return retdataset

  3.选择最好的划分特征

 choose the best feature
def choosebestfeature(dataset):
    baseentroy = calcshannonent(dataset)
    featurenum = len(dataset[0]) - 1
    bestinformationgain = 0.0
    choosen = -1
    for i in range(featurenum):
        featurelisit = [example[i] for example in dataset]
        listcount = set(featurelisit)
        tempentroy = 0.0
        for j in listcount:
            subdataset = splitdataset(dataset, i, j)
            subentropy = calcshannonent(subdataset)
            prob = len(subdataset)/float(len(dataset))
            tempentroy = prob*subentropy
        informationgain = baseentroy - tempentroy
        if informationgain > bestinformationgain:
            bestinformationgain = baseentroy - tempentroy
            choosen = i
    return choosen

  4.投票判断所属额类别

 vote the class
def votedclass(classlist):
    classcount = {}
    for vote in classlist:
        if vote not in classcount.keys():
            classcount[vote] = 0
        classcount[vote] += 1
    classcountsorted = sorted(classcount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return classcountsorted[0][0]

  5.构建一棵决策树

# create my tree
def createtree(dataset, lables):
    lablename = lables[:]
    classlist = [example[-1] for example in dataset]
    if classlist.count(classlist[0]) == len(classlist):
        return classlist[0]
    if len(dataset[0]) == 1:
        return votedclass(classlist)
    bestfeature = choosebestfeature(dataset)
    featurelable = lablename[bestfeature]
    del(lablename[bestfeature])
    mytree = {featurelable: {}}
    featurelist = [example[bestfeature] for example in dataset]
    featurelistcount = set(featurelist)
    for value in featurelistcount:
        sublables = lablename[:]
        subdataset = splitdataset(dataset, bestfeature, value)
        mytree[featurelable][value] = createtree(subdataset, sublables)
    return mytree

  6.输入决策向量的类别

# classify the vector
def classify(inputtree, lables, tecvector):
    firlb = str(inputtree.keys()[0])
    subdict = inputtree[firlb]
    firindex = lables.index(firlb)
    classlable = 'something error'
    for key in subdict.keys():
        if tecvector[firindex] == key:
            if type(subdict[key]).__name__ == 'dict':
                classlable = classify(subdict[key], lables, tecvector)
            else:
                classlable = subdict[key]
    return classlable

  7.存储和取出决策树

# store the tree
def storetree(mytree, filename):
    fp = open(filename, 'w')
    pickle.dump(mytree, fp)
    fp.close()

# get the tree
def gettree(filename):
    fp = open(filename)
    return pickle.load(fp)

3.5 实战,构建一棵判断佩戴什么隐形眼镜的决策树

import tree
fr = open("F:data/machinelearninginaction/Ch03/lenses.txt")
lines = fr.readlines()
line = [example.strip().split("\t")for example in lines]
datalable = ['age', 'prescript', 'astigmatic', 'tearrate']
mytree = tree.createtree(line, datalable)
print mytree

3.6思考

1.以上所创造的决策树的数据格式是,每一行的前面n-1个元素是特征值,最后一个元素是类名称

2.决策树的缺点是过度匹配。这里使用的id3算法,虽然很简单,但是还有很多的缺点。可以使用cartc4.5算法,进行优化,适当的剪枝合并。

3.决策树分类器就像有终止块的流程图,它的终止块就表示它的分类的结果。

4.此代码来源于机器学习实战这本书。书上在创建树的时候有误,应该先把lables的值传入到一个新的变量中保存起来,然后再用这个新的变量执行。否则它会改变lables的值。

 

posted @ 2017-03-24 17:48  whatyouknow123  阅读(362)  评论(0编辑  收藏  举报