数据挖掘算法（二）--决策树

数据挖掘算法学习笔记汇总
数据挖掘算法（一）–K近邻算法（KNN）
数据挖掘算法（二）–决策树
 数据挖掘算法（三）–logistic回归

算法简介

首先我们来看一组数据，如下所示，

Age Education   Income  Status  Purchase?
36-55   master's    high    single  will buy
18-35   high school low single  will buy
36-55   master's    low single  won't buy
18-35   bachelor's  high    single  won't buy
gt 18   high school low single  will buy
18-35   bachelor's  high    married won't buy

假设我们要对数据进行分类，然后根据年龄、教育程度、收入、婚姻状况等条件判断一个人是否会购买。假设我们先按照是否结婚的来对上面的几组数据进行划分：

#未婚
Age Education   Income  Status  Purchase?
36-55   master's    high    single  will buy
18-35   high school low single  will buy
36-55   master's    low single  won't buy
18-35   bachelor's  high    single  won't buy
gt 18   high school low single  will buy

#已婚
Age Education   Income  Status  Purchase?
18-35   bachelor's  high    married won't buy

下面可以再按照收入进一步划分数据

Age Education   Income  Status  Purchase?
18-35   bachelor's  high    single  won't buy
36-55   master's    high    single  will buy

Age Education   Income  Status  Purchase?
18-35   high school low single  will buy
36-55   master's    low single  won't buy
gt 18   high school low single  will buy

然后我们需要再按照其他的特征（如年龄，或者收入）进行再次分类，直到每组数据都归类了为止。这样我们构建出来了一个颗树形结构的图。
这里写图片描述

这个分析过程就是一个决策树的过程。那每一次分类过程中我们都需要选择一个特征（年龄、教育程度、收入、婚姻状况）进行分类，到底选哪一个好？这就需要用到熵作为一个选取标准了。

信息熵：它是随机变量不确定度的度量。一个离散随机变量X的概率密度函数为p(x)，那么X的熵可以定义为
$H (x) = - \sum x p (x) l o g 2 p (x)$ 使用2为底的对数函数，熵的量纲一般情况下为比特（bite）。当对数底位e的情况，熵的单位为奈特（nat）。在平均意义下，它是为了描述改随机变量所需的比特数。

怎么使用熵，就需要利用ID3算法了

ID3算法以原始数据集合S 作为根节点开始。在算法的每次迭代中，它遍历数据集合S 的每个未使用的特征，并计算熵H（S）（或信息增益IG（S）)。然后选择具有最小熵（或最大信息增益）值的属性。然后将集合S分割为所选属性（例如收入高和低），以产生数据的子集。算法继续在每个子集上递归，仅考虑之前从未选择的属性。

算法流程：
1、使用数据集S计算每个属性的熵
2、将数据集S分解为子集，其中使用生成熵（分割后）最小的属性（或等效地，信息增益最大）
3、制作包含该属性的决策树节点
4、使用剩余属性重新计算子集。

ID3算法不保证最佳解决方案；它可能会陷入局部最佳状态。它使用贪心的方法，在每次迭代中选择最佳属性来分割数据集。
ID3可以对训练数据进行补充。为了避免过度拟合，较大的决策树应优先于较小的决策树。该算法通常产生小树，但并不总是产生尽可能小的树。
ID3难用于连续数据。如果任何给定属性的值是连续的，那么还有更多的地方可以分割此属性上的数据，并且搜索最佳值进行拆分可能是耗时的。

代码实现

本文代码运行环境：
python：3.5.1
pandas：0.19.2
sklearn：0.18.1
其他环境可能有细微差别

第一步读取数据

# -*coding:utf-8*-
import numpy as np
import pandas as pd
import operator
import math

# 读取数据
dataSet = pd.read_csv("./tree.txt", sep='\t')
labels = list(dataSet) #获取所有列的名字
labels = labels[0:-1] #获取特征名字

里面前面给出的计算香农熵的公式，计算熵。

def calcShannonEnt(dataFrame):
    numEntries = len(dataFrame)
    labelCounts = {}
    for index, row in dataFrame.iterrows():  
        currentLabel = row[-1]
        labelCounts[currentLabel] = labelCounts.get(currentLabel, 0) + 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        shannonEnt -= prob * math.log(prob, 2)  # log base 2
    return shannonEnt

分隔dataFrame

def splitDataSet(dataFrame, column_name, value):
    retDataSet = dataFrame[dataFrame[column_name] != value]  # 选取clomun_name列不等于value的行
    retDataSet = retDataSet.drop(column_name, 1)  # 1删除列，0删列行
    return retDataSet

选择最佳的特征和值进行分类，并且返回最佳的列名

def chooseBestFeatureToSplit(dataFrame):
    baseEntropy = calcShannonEnt(dataFrame)
    bestInfoGain = -100#此处由于计算出来的增益可能为负的
    bestFeature = ""
    FeatureNames = list(dataFrame)
    for FeatureName in FeatureNames[0:-1]:  # 最后一列为标签
        featList = dataFrame[FeatureName]  # create a list of all the examples of this feature
        uniqueVals = set(featList)  # get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataFrame, FeatureName, value)
            prob = subDataSet.shape[0] / dataFrame.shape[0]
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):  # compare this to the best gain so far
            bestInfoGain = infoGain  # if better than current best, set to best
            bestFeature = FeatureName
    return bestFeature  #返回列名

def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        classCount[vote] = classCount.get(vote, 0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

#生成决策树
def createTree(dataFrame):
    classList = dataFrame.iloc[:, -1]  # 标签列表
    if len(classList.unique()) == 1:  # 标签只剩下一种值
        return classList.unique()[0]  # stop splitting when all of the classes are equal
    if len(list(dataFrame)) == 2:  # 只剩下一种特征了
        return majorityCnt(classList)
    bestFeatLabel = chooseBestFeatureToSplit(dataFrame)
    myTree = {bestFeatLabel: {}}
    featValues = dataFrame[bestFeatLabel]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        DataSubset = splitDataSet(dataFrame, bestFeatLabel, value)
        myTree[bestFeatLabel][value] = createTree(DataSubset)
    return myTree


def classify(inputTree, featLabels, testVec):
    keys_list = list(inputTree.keys())
    firstStr = keys_list[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict):
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else:
        classLabel = valueOfFeat
    return classLabel

tree = createTree(dataSet)
print(classify(tree, labels, ["36-55", "master's", "high", "single"]))

构建出来的树存在一个字典里面，下面是格式化之后好看点的数据。如下所示：

{
    'Status': {
        'married': {
            'Income': {
                'high': {
                    'Education': {
                        'highschool': 'willbuy',
                        "master's": 'willbuy',
                        "bachelor's": 'willbuy'
                    }
                },
                'low': {
                    'Education': {
                        'highschool': 'willbuy',
                        "master's": "won't buy",
                        "bachelor's": 'willbuy'
                    }
                }
            }
        },
        'single': {
            'Income': {
                'high': {
                    'Age': {
                        '36-55': 'willbuy',
                        'gt55': "won't buy",
                        'lt18': "won't buy"
                    }
                },
                'low': {
                    'Education': {
                        'highschool': "won't buy",
                        "bachelor's": 'willbuy'
                    }
                }
            }
        }
    }
}

代码和测试数据下载地址：
链接：http://pan.baidu.com/s/1hsOTpfm 密码：en6g

参考资料：
1、《机器学习实战》
2、http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=1

posted @ 2017-09-21 21:58 0pandas0 阅读(146) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

0pandas0

数据挖掘算法（二）--决策树

算法简介

代码实现

公告