数据挖掘算法(二)--决策树

数据挖掘算法学习笔记汇总
数据挖掘算法(一)–K近邻算法 (KNN)
数据挖掘算法(二)–决策树
数据挖掘算法(三)–logistic回归

算法简介

首先我们来看一组数据,如下所示,

Age Education   Income  Status  Purchase?
36-55   master's    high    single  will buy
18-35   high school low single  will buy
36-55   master's    low single  won't buy
18-35   bachelor's  high    single  won't buy
gt 18   high school low single  will buy
18-35   bachelor's  high    married won't buy

假设我们要对数据进行分类,然后 根据年龄、教育程度、收入、婚姻状况等条件判断一个人是否会购买。假设我们先按照是否结婚的来对上面的几组数据进行划分:

#未婚
Age Education   Income  Status  Purchase?
36-55   master's    high    single  will buy
18-35   high school low single  will buy
36-55   master's    low single  won't buy
18-35   bachelor's  high    single  won't buy
gt 18   high school low single  will buy
#已婚
Age Education   Income  Status  Purchase?
18-35   bachelor's  high    married won't buy

下面可以再按照收入进一步划分数据

Age Education   Income  Status  Purchase?
18-35   bachelor's  high    single  won't buy
36-55   master's    high    single  will buy
Age Education   Income  Status  Purchase?
18-35   high school low single  will buy
36-55   master's    low single  won't buy
gt 18   high school low single  will buy

然后我们需要再按照其他的特征(如年龄,或者收入)进行再次分类,直到每组数据都归类了为止。这样我们构建出来了一个颗树形结构的图。
这里写图片描述

这个分析过程就是一个决策树的过程。那每一次分类过程中我们都需要选择一个特征(年龄、教育程度、收入、婚姻状况)进行分类,到底选哪一个好?这就需要用到熵作为一个选取标准了。

信息熵 :它是随机变量不确定度的度量。一个离散随机变量X的概率密度函数为p(x),那么X的熵可以定义为

H(x)=xp(x)log2p(x)
使用2为底的对数函数,熵的量纲一般情况下为比特(bite)。当对数底位e的情况,熵的单位为奈特(nat)。在平均意义下,它是为了描述改随机变量所需的比特数。

怎么使用熵,就需要利用ID3算法了

ID3算法以原始数据集合S 作为根节点开始。在算法的每次迭代中,它遍历数据集合S 的每个未使用的特征,并计算熵HS(或信息增益IGS)。然后选择具有最小熵(或最大信息增益)值的属性。然后将集合S分割为所选属性(例如收入高和低),以产生数据的子集。算法继续在每个子集上递归,仅考虑之前从未选择的属性。

算法流程:
1、使用数据集S计算每个属性的熵
2、将数据集S分解为子集,其中使用生成熵(分割后)最小的属性(或等效地,信息增益最大)
3、制作包含该属性的决策树节点
4、使用剩余属性重新计算子集。

ID3算法不保证最佳解决方案;它可能会陷入局部最佳状态。 它使用贪心的方法,在每次迭代中选择最佳属性来分割数据集。
ID3可以对训练数据进行补充。 为了避免过度拟合,较大的决策树应优先于较小的决策树。 该算法通常产生小树,但并不总是产生尽可能小的树。
ID3难用于连续数据。 如果任何给定属性的值是连续的,那么还有更多的地方可以分割此属性上的数据,并且搜索最佳值进行拆分可能是耗时的。

代码实现

本文代码运行环境:
python:3.5.1
pandas:0.19.2
sklearn:0.18.1
其他环境可能有细微差别

第一步读取数据

# -*coding:utf-8*-
import numpy as np
import pandas as pd
import operator
import math

# 读取数据
dataSet = pd.read_csv("./tree.txt", sep='\t')
labels = list(dataSet) #获取所有列的名字
labels = labels[0:-1] #获取特征名字

里面前面给出的计算香农熵的公式,计算熵。

def calcShannonEnt(dataFrame):
    numEntries = len(dataFrame)
    labelCounts = {}
    for index, row in dataFrame.iterrows():  
        currentLabel = row[-1]
        labelCounts[currentLabel] = labelCounts.get(currentLabel, 0) + 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        shannonEnt -= prob * math.log(prob, 2)  # log base 2
    return shannonEnt

分隔dataFrame

def splitDataSet(dataFrame, column_name, value):
    retDataSet = dataFrame[dataFrame[column_name] != value]  # 选取clomun_name列不等于value的行
    retDataSet = retDataSet.drop(column_name, 1)  # 1删除列,0删列行
    return retDataSet

选择最佳的特征和值进行分类,并且返回最佳的列名

def chooseBestFeatureToSplit(dataFrame):
    baseEntropy = calcShannonEnt(dataFrame)
    bestInfoGain = -100#此处由于计算出来的增益可能为负的
    bestFeature = ""
    FeatureNames = list(dataFrame)
    for FeatureName in FeatureNames[0:-1]:  # 最后一列为标签
        featList = dataFrame[FeatureName]  # create a list of all the examples of this feature
        uniqueVals = set(featList)  # get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataFrame, FeatureName, value)
            prob = subDataSet.shape[0] / dataFrame.shape[0]
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if (infoGain > bestInfoGain):  # compare this to the best gain so far
            bestInfoGain = infoGain  # if better than current best, set to best
            bestFeature = FeatureName
    return bestFeature  #返回列名
def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        classCount[vote] = classCount.get(vote, 0) + 1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

#生成决策树
def createTree(dataFrame):
    classList = dataFrame.iloc[:, -1]  # 标签列表
    if len(classList.unique()) == 1:  # 标签只剩下一种值
        return classList.unique()[0]  # stop splitting when all of the classes are equal
    if len(list(dataFrame)) == 2:  # 只剩下一种特征了
        return majorityCnt(classList)
    bestFeatLabel = chooseBestFeatureToSplit(dataFrame)
    myTree = {bestFeatLabel: {}}
    featValues = dataFrame[bestFeatLabel]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        DataSubset = splitDataSet(dataFrame, bestFeatLabel, value)
        myTree[bestFeatLabel][value] = createTree(DataSubset)
    return myTree


def classify(inputTree, featLabels, testVec):
    keys_list = list(inputTree.keys())
    firstStr = keys_list[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    key = testVec[featIndex]
    valueOfFeat = secondDict[key]
    if isinstance(valueOfFeat, dict):
        classLabel = classify(valueOfFeat, featLabels, testVec)
    else:
        classLabel = valueOfFeat
    return classLabel

tree = createTree(dataSet)
print(classify(tree, labels, ["36-55", "master's", "high", "single"]))

构建出来的树存在一个字典里面,下面是格式化之后好看点的数据。如下所示:

{
    'Status': {
        'married': {
            'Income': {
                'high': {
                    'Education': {
                        'highschool': 'willbuy',
                        "master's": 'willbuy',
                        "bachelor's": 'willbuy'
                    }
                },
                'low': {
                    'Education': {
                        'highschool': 'willbuy',
                        "master's": "won't buy",
                        "bachelor's": 'willbuy'
                    }
                }
            }
        },
        'single': {
            'Income': {
                'high': {
                    'Age': {
                        '36-55': 'willbuy',
                        'gt55': "won't buy",
                        'lt18': "won't buy"
                    }
                },
                'low': {
                    'Education': {
                        'highschool': "won't buy",
                        "bachelor's": 'willbuy'
                    }
                }
            }
        }
    }
}

代码和测试数据下载地址:
链接:http://pan.baidu.com/s/1hsOTpfm 密码:en6g

参考资料:
1、《机器学习实战》
2、http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html?page=1

posted @ 2017-09-21 21:58  0pandas0  阅读(146)  评论(0编辑  收藏  举报