决策树--从原理到实现

一.引入

决策树基本上是每一本机器学习入门书籍必讲的东西,其决策过程和平时我们的思维很相似,所以非常好理解,同时有一堆信息论的东西在里面,也算是一个入门应用,决策树也有回归和分类,但一般来说我们主要讲的是分类

其实,个人感觉,决策树是从一些数据量中提取特征,按照特征的显著由强到弱来排列。常见应用为:回答一些问题,猜出你心里想的是什么?

为什么第一个问题,永远都是男还是女?为什么?看完这个就知道了

 

二.代码

  1 from math import log
  2 import operator
  3 
  4 def createDataSet():
  5     dataSet = [[1, 1, 'yes'],
  6                [1, 1, 'yes'],
  7                [1, 0, 'no'],
  8                [0, 1, 'no'],
  9                [0, 1, 'no']]
 10     labels = ['no surfacing','flippers']
 11     #change to discrete values
 12     return dataSet, labels
 13 
 14 def calcShannonEnt(dataSet):
 15     numEntries = len(dataSet)
 16     labelCounts = {}
 17     for featVec in dataSet: #the the number of unique elements and their occurance
 18         currentLabel = featVec[-1]
 19         if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
 20         labelCounts[currentLabel] += 1
 21     shannonEnt = 0.0
 22     for key in labelCounts:
 23         prob = float(labelCounts[key])/numEntries
 24         shannonEnt -= prob * log(prob,2) #log base 2
 25     return shannonEnt
 26     
 27 def splitDataSet(dataSet, axis, value):
 28     retDataSet = []
 29     for featVec in dataSet:
 30         if featVec[axis] == value:
 31             reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
 32             reducedFeatVec.extend(featVec[axis+1:])
 33             retDataSet.append(reducedFeatVec)
 34     return retDataSet
 35     
 36 def chooseBestFeatureToSplit(dataSet):
 37     numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
 38     baseEntropy = calcShannonEnt(dataSet)
 39     bestInfoGain = 0.0; bestFeature = -1
 40     for i in range(numFeatures):        #iterate over all the features
 41         featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
 42         uniqueVals = set(featList)       #get a set of unique values
 43         newEntropy = 0.0
 44         for value in uniqueVals:
 45             subDataSet = splitDataSet(dataSet, i, value)
 46             prob = len(subDataSet)/float(len(dataSet))
 47             newEntropy += prob * calcShannonEnt(subDataSet)     
 48         infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
 49         if (infoGain > bestInfoGain):       #compare this to the best gain so far
 50             bestInfoGain = infoGain         #if better than current best, set to best
 51             bestFeature = i
 52     return bestFeature                      #returns an integer
 53 
 54 def majorityCnt(classList):
 55     classCount={}
 56     for vote in classList:
 57         if vote not in classCount.keys(): classCount[vote] = 0
 58         classCount[vote] += 1
 59     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
 60     return sortedClassCount[0][0]
 61 
 62 def createTree(dataSet,labels):
 63     classList = [example[-1] for example in dataSet]
 64     if classList.count(classList[0]) == len(classList): 
 65         return classList[0]#stop splitting when all of the classes are equal
 66     if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
 67         return majorityCnt(classList)
 68     bestFeat = chooseBestFeatureToSplit(dataSet)
 69     bestFeatLabel = labels[bestFeat]
 70     myTree = {bestFeatLabel:{}}
 71     del(labels[bestFeat])
 72     featValues = [example[bestFeat] for example in dataSet]
 73     uniqueVals = set(featValues)
 74     for value in uniqueVals:
 75         subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
 76         myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
 77     return myTree                            
 78     
 79 def classify(inputTree,featLabels,testVec):
 80     firstStr = inputTree.keys()[0]
 81     secondDict = inputTree[firstStr]
 82     featIndex = featLabels.index(firstStr)
 83     key = testVec[featIndex]
 84     valueOfFeat = secondDict[key]
 85     if isinstance(valueOfFeat, dict): 
 86         classLabel = classify(valueOfFeat, featLabels, testVec)
 87     else: classLabel = valueOfFeat
 88     return classLabel
 89 
 90 def storeTree(inputTree,filename):
 91     import pickle
 92     fw = open(filename,'w')
 93     pickle.dump(inputTree,fw)
 94     fw.close()
 95     
 96 def grabTree(filename):
 97     import pickle
 98     fr = open(filename)
 99     return pickle.load(fr)
100     

三.算法详解

❤信息增益

传入数据集,得到该数据集的增益

 1 def calcShannonEnt(dataSet):
 2     numEntries = len(dataSet)
 3     labelCounts = {}
 4     for featVec in dataSet: #the the number of unique elements and their occurance
 5         currentLabel = featVec[-1]
 6         if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
 7         labelCounts[currentLabel] += 1
 8     shannonEnt = 0.0
 9     for key in labelCounts:
10         prob = float(labelCounts[key])/numEntries
11         shannonEnt -= prob * log(prob,2) #log base 2
12     return shannonEnt

得到信息熵后,我们按照获取最大信息增益的方法划分数据集就行了

eg.运行下面的数据集

          [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]

labelCounts是一个map结构
currentLabel  labelCounts[currentLabel]   prob
yes        2                0.4
no         3                0.6

用信息论就可以得到0.4*log(-0.4)+0,6*log(-0.6)=0.971

❤划分数据集

  ※按照给定特征划分数据集

  传入数据集,第axis个(从0开始)特征,该特征的值

  输出根据该数据集划分得到的子数据集

1 def splitDataSet(dataSet, axis, value):
2     retDataSet = []
3     for featVec in dataSet:
4         if featVec[axis] == value:
5             reducedFeatVec = featVec[:axis]     #chop out axis used for splitting
6             reducedFeatVec.extend(featVec[axis+1:])
7             retDataSet.append(reducedFeatVec)
8     return retDataSet
 eg.  myDat为
      [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
传入(myDat,0,1),输出

[[1, 'yes'],[1, 'yes'], [0, 'no']]

  ※选择最好的数据集划分方式

  传入数据集

  输出该数据集下按不同特征值排列得到信息熵变化最大的该特征值

 1 def chooseBestFeatureToSplit(dataSet):
 2     numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
 3     baseEntropy = calcShannonEnt(dataSet)
 4     bestInfoGain = 0.0; bestFeature = -1
 5     for i in range(numFeatures):        #iterate over all the features
 6         featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
 7         uniqueVals = set(featList)       #get a set of unique values
 8         newEntropy = 0.0
 9         for value in uniqueVals:
10             subDataSet = splitDataSet(dataSet, i, value)
11             prob = len(subDataSet)/float(len(dataSet))
12             newEntropy += prob * calcShannonEnt(subDataSet)     
13         infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
14         if (infoGain > bestInfoGain):       #compare this to the best gain so far
15             bestInfoGain = infoGain         #if better than current best, set to best
16             bestFeature = i
17     return bestFeature                      #returns an integer
 eg.  myDat为
      [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
传入(myDat)

第一次就是按第一个特征,值为1划分
     按第一个特征,值为0划分
     得到该情况下的信息熵
第二次就是按第二个特征,值为1划分
     按第二个特征,值为0划分
     得到该情况下的信息熵
......
选取信息熵最大时候的特征
  

❤递归构建决策树

1 def majorityCnt(classList):
2     classCount={}
3     for vote in classList:
4         if vote not in classCount.keys(): classCount[vote] = 0
5         classCount[vote] += 1
6     sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
7     return sortedClassCount[0][0]

 实现:传入label,统计不同label出现的频率,返回出现频率最大的

 

O(∩_∩)O~创建树啦

两个输入参数:数据集和标签列表

 1 def createTree(dataSet,labels):
 2     classList = [example[-1] for example in dataSet]
 3     if classList.count(classList[0]) == len(classList): 
 4         return classList[0]#stop splitting when all of the classes are equal
 5     if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet
 6         return majorityCnt(classList)
 7     bestFeat = chooseBestFeatureToSplit(dataSet)
 8     bestFeatLabel = labels[bestFeat]
 9     myTree = {bestFeatLabel:{}}
10     del(labels[bestFeat])
11     featValues = [example[bestFeat] for example in dataSet]
12     uniqueVals = set(featValues)
13     for value in uniqueVals:
14         subLabels = labels[:]       #copy all of labels, so trees don't mess up existing labels
15         myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)
16     return myTree  

O(∩_∩)O~~可以使用树来决策了

 1 def classify(inputTree,featLabels,testVec):
 2     firstStr = inputTree.keys()[0]
 3     secondDict = inputTree[firstStr]
 4     featIndex = featLabels.index(firstStr)
 5     key = testVec[featIndex]
 6     valueOfFeat = secondDict[key]
 7     if isinstance(valueOfFeat, dict): 
 8         classLabel = classify(valueOfFeat, featLabels, testVec)
 9     else: classLabel = valueOfFeat
10     return classLabel

 测试如下:

1 >>> import trees
2 >>> myDat,labels=trees.createDataSet()
3 >>> myTree=trees.createTree(myDat,labels)
4 >>> myTree
5 {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
6 >>> lab=trees.classify(myTree,['no surfacing','flippers'],[0,1])
7 >>> lab
8 'no'

 

posted on 2017-09-12 11:52  小嘤嘤  阅读(253)  评论(0编辑  收藏  举报

导航