决策树算法和它的实现方式
第三章 决策树
3.1 算法简介
决策树就像一个数据流程图,区别在于决策树可以在不熟悉的数据集中,提取规则,进而构造出一棵决策树
优点:计算复杂度不高,输出的结果易于理解, 对中间缺省值不敏感,可以处理不想关的特征
缺点:容易发生过度匹配
适用的数据类型:数值型和标称型
3.2 算法的一般流程
1.收集数据
2.处理数据:构造树需要使用标称数据,因此如果是数值型数据的话必须进行离散化
3.分析数据
4.训练算法:利用训练数据构造出一棵决策树
5.测试算法
6.使用算法
3.3相关的概念
决策树算法的关键在于怎么选择分类的特征,这里使用的是id3算法。所谓的id3算法指的是选择信息增益最大的特征作为划分的特征。
1.什么是信息增益
信息增益是香农提出来的,指的是划分数据之前与之后信息发生的变化。
2.怎么计算信息的值
l(xi) = - logp(xi)
P(xi) = xi / x (就是数据集中某一个类的数量除以总的数量)
3.什么计算熵的值
H = p(xi)l(xi) (这里的i的范围是指所有的分类)
3.4算法的实现过程
1.获取香农熵
rom math import log import operator import pickle # calculate the entropy def calcshannonent(dataset): datasetnum = len(dataset) datalable = {} for eachdata in dataset: fetchlable = eachdata[-1] if fetchlable not in datalable.keys(): datalable[fetchlable] = 0 datalable[fetchlable] += 1 entroy = 0.0 for eachlable in datalable: prob = float(datalable[eachlable])/datasetnum entroy -= prob*log(prob, 2) return entroy
2.划分数据集
split the data set def splitdataset(dataset, axis, value): retdataset = [] for each in dataset: if each[axis] == value: splitdata = each[:axis] splitdata.extend(each[axis+1:]) retdataset.append(splitdata) return retdataset
3.选择最好的划分特征
choose the best feature def choosebestfeature(dataset): baseentroy = calcshannonent(dataset) featurenum = len(dataset[0]) - 1 bestinformationgain = 0.0 choosen = -1 for i in range(featurenum): featurelisit = [example[i] for example in dataset] listcount = set(featurelisit) tempentroy = 0.0 for j in listcount: subdataset = splitdataset(dataset, i, j) subentropy = calcshannonent(subdataset) prob = len(subdataset)/float(len(dataset)) tempentroy = prob*subentropy informationgain = baseentroy - tempentroy if informationgain > bestinformationgain: bestinformationgain = baseentroy - tempentroy choosen = i return choosen
4.投票判断所属额类别
vote the class def votedclass(classlist): classcount = {} for vote in classlist: if vote not in classcount.keys(): classcount[vote] = 0 classcount[vote] += 1 classcountsorted = sorted(classcount.iteritems(), key=operator.itemgetter(1), reverse=True) return classcountsorted[0][0]
5.构建一棵决策树
# create my tree def createtree(dataset, lables): lablename = lables[:] classlist = [example[-1] for example in dataset] if classlist.count(classlist[0]) == len(classlist): return classlist[0] if len(dataset[0]) == 1: return votedclass(classlist) bestfeature = choosebestfeature(dataset) featurelable = lablename[bestfeature] del(lablename[bestfeature]) mytree = {featurelable: {}} featurelist = [example[bestfeature] for example in dataset] featurelistcount = set(featurelist) for value in featurelistcount: sublables = lablename[:] subdataset = splitdataset(dataset, bestfeature, value) mytree[featurelable][value] = createtree(subdataset, sublables) return mytree
6.输入决策向量的类别
# classify the vector def classify(inputtree, lables, tecvector): firlb = str(inputtree.keys()[0]) subdict = inputtree[firlb] firindex = lables.index(firlb) classlable = 'something error' for key in subdict.keys(): if tecvector[firindex] == key: if type(subdict[key]).__name__ == 'dict': classlable = classify(subdict[key], lables, tecvector) else: classlable = subdict[key] return classlable
7.存储和取出决策树
# store the tree def storetree(mytree, filename): fp = open(filename, 'w') pickle.dump(mytree, fp) fp.close() # get the tree def gettree(filename): fp = open(filename) return pickle.load(fp)
3.5 实战,构建一棵判断佩戴什么隐形眼镜的决策树
import tree fr = open("F:data/machinelearninginaction/Ch03/lenses.txt") lines = fr.readlines() line = [example.strip().split("\t")for example in lines] datalable = ['age', 'prescript', 'astigmatic', 'tearrate'] mytree = tree.createtree(line, datalable) print mytree
3.6思考
1.以上所创造的决策树的数据格式是,每一行的前面n-1个元素是特征值,最后一个元素是类名称
2.决策树的缺点是过度匹配。这里使用的id3算法,虽然很简单,但是还有很多的缺点。可以使用cart和c4.5算法,进行优化,适当的剪枝合并。
3.决策树分类器就像有终止块的流程图,它的终止块就表示它的分类的结果。
4.此代码来源于机器学习实战这本书。书上在创建树的时候有误,应该先把lables的值传入到一个新的变量中保存起来,然后再用这个新的变量执行。否则它会改变lables的值。