Decision Tree
决策树(Decision Tree)算法是一种基本的分类与回归方法
分类决策树模型是一种描述对实例进行分类的树形结构。决策树由结点(node)和有向边(directed edge)组成。结点有两种类型:内部结点(internal node)和叶结点(leaf node)。内部结点表示一个特征或属性(features),叶结点表示一个类(labels)。
优缺点:
优点:计算复杂度不高,输出结果易于理解,数据有缺失也能跑,可以处理不相关特征。
缺点:容易过拟合。
适用数据类型:数值型和标称型。
流程伪代码:
def createBranch(): 检测数据集中的所有数据的分类标签是否相同: If so return 类标签 Else: 寻找划分数据集的最好特征(划分之后信息熵最小,也就是信息增益最大的特征) 划分数据集 创建分支节点 for 每个划分的子集 调用函数 createBranch (创建分支的函数)并增加返回结果到分支节点中 return 分支节点
核心代码:
1. 信息熵的计算:
def calcShannonEnt(dataSet): numEntries = len(dataSet) labelCounts = {} for featVec in dataSet: currentLabel = featVec[-1] if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 shannonEnt = 0.0 for key in labelCounts: prob = float(labelCounts[key])/numEntries shannonEnt -= prob * log(prob, 2) return shannonEnt
2. 选择切分数据集的最佳特征:
def chooseBestFeatureToSplit(dataSet): numFeatures = len(dataSet[0]) - 1 baseEntropy = calcShannonEnt(dataSet) bestInfoGain, bestFeature = 0.0, -1 for i in range(numFeatures): featList = [example[i] for example in dataSet] uniqueVals = set(featList) newEntropy = 0.0 for value in uniqueVals: subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) infoGain = baseEntropy - newEntropy print('infoGain=', infoGain, 'bestFeature=', i, baseEntropy, newEntropy) if (infoGain > bestInfoGain): bestInfoGain = infoGain bestFeature = i return bestFeature
3. 创建决策树:
def createTree(dataSet, labels): classList = [example[-1] for example in dataSet] if classList.count(classList[0]) == len(classList): return classList[0] if len(dataSet[0]) == 1: return majorityCnt(classList) bestFeat = chooseBestFeatureToSplit(dataSet) bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel: {}} del(labels[bestFeat]) featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues) for value in uniqueVals: subLabels = labels[:] myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels) return myTree