针对天气数据的决策树建立

针对天气数据的决策树建立

目标

如下图所示,利用图中天气数据建立决策树,但是对于温度与湿度进行预处理,即将温度与湿度分为4个范围,以这0-3这四个值代替原来的值作为特征。对应关系如下所示:

[60,70)对应0
[70,80)对应1
[80,90)对应2
[90,100)对应3

建立决策树

树的建立

通过循环迭代,创建决策树,当以下两个条件满足时停止迭代:

  • 如果剩下待分类数据都是一个类,则终止迭代。
  • 如果所有的特征都被用于分类后,剩下的数据仍然不属于同一个类,则选取其中最多的类别作为该叶子的类别

每次迭代选取最好的特征作为这一次分类的特征,然后再将依据此特征分成的数据分别放入下一次迭代,直至停止。

最好特征的选择

采用这样的思想:一个好的特征,依据这个特征分类后,每个类别更加有序,即趋向于属于同一类。因此选择特征时,分别计算依据此特征分类后的各个类的信息熵的加权和,和越小,这个特征越好,选取和最小的作为此次分类的特征。
如对于数据【0,1,1,0,1】根据特征n1分成了两部分a:【0,1】b:【1,0,1】。则先计算每一部分信息熵:

\(a=-\frac{1}{2}\log_2{\frac{1}{2}}-\frac{1}{2}\log_2{\frac{1}{2}}\)
\(b=-\frac{1}{3}\log_2{\frac{1}{3}}-\frac{2}{3}\log_2{\frac{2}{3}}\)

然后计算它们的加权和:

\(\frac{2}{5}a+\frac{3}{5}b\)

这个值就是以特征n1分类的得分,然后以特征n2分类,再计算其得分,直至遍历所有特征,选取得分最低的作为分类特征。
创建结果如下:

{'湿度': {0: '适合', 1: {'天气': {'晴': '适合', '有雨': '不适合', '多云': '适合'}}, 2: {'温度': {0: '适合', 1: {'风况': {'无': '适合', '有': '不适合'}}, 2: '不适合'}}, 3: {'天气': {'晴': '不适合', '有雨': '适合', '多云': '适合'}}}}

可视化如下图所示:

利用决策树分类

利用上面构造好的决策树进行分类,以【'晴', 1, 3, '无'】这条数据为例,按上图所示,首先看湿度,这里为3。所以接着看天气,这里为晴,所以分为不适合运动,最后结果也如上面分析所示。运行结果如下:

代码

import collections
import math
import pickle
import operator

#训练数据存在这里
def dataset():
    #温度湿度[60,70)0 [70,80)1 [80,90)2 [90,100)3
    labels = ['天气','温度','湿度','风况','运动']
    dataSet = [['晴', 2, 2,'无','不适合'],
               ['晴', 2, 3, '有', '不适合'],
               ['多云', 2, 1, '无', '适合'],
               ['有雨', 1, 3, '无', '适合'],
               ['有雨', 0, 2, '无', '适合'],
               ['有雨', 0, 1, '有', '不适合'],
               ['多云', 0, 0, '有', '适合'],
               ['晴', 1, 3, '无', '不适合'],
               ['晴', 0, 1, '无', '适合'],
               ['有雨', 1, 2, '无', '适合'],
               ['晴', 1, 1, '有', '适合'],
               ['多云', 1, 3, '有', '适合'],
               ['多云', 2, 1, '无', '适合'],
               ['有雨', 1, 2, '有', '不适合']]
    return dataSet, labels

#计算信息熵
def cal_entropy(dataset):
    length = len(dataset)
    entropy = 0
    count = {}
    for i in dataSet:
        label = i[-1]
        count[label] = count.get(label, 0) + 1
    for key in count:
        p = count[key] / length
        entropy = entropy - p * math.log(p, 2)
    return entropy

#划分数据集
def splitDataSet(dataSet, axis, value):
    childDataSet = []
    for i in dataSet:
        if i[axis] == value:
            childList = i[:axis]
            childList.extend(i[axis + 1:])
            childDataSet.append(childList)
    # print(childDataSet)
    return childDataSet

#选择最好的特征
def chooseFeature(dataset):
    old_entropy = cal_entropy(dataset)
    character = -1
    for i in range(len(dataset[0]) - 1):
        newEntropy = 0
        featureList = [word[i] for word in dataset]
        value = set(featureList)
        for j in value:
            childDataSet = splitDataSet(dataset, i, j)
            newEntropy += len(childDataSet) / len(dataset) * cal_entropy(childDataSet)
        if (newEntropy < old_entropy):
            character = i
            old_entropy = newEntropy
    return character

#当遍历完所有特征时,用于选取当前数据集中最多的一个类别代表该类别
def most(classList):
    classCount = {}
    for i in range(len(classList)):
        classCount[i] = classCount.get(i, 0) + 1
    sortCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    # print(sortCount)
    return sortCount[0][0]

#构造决策树
def createDT(dataSet, labels):
    # print(dataSet)
    tempLabels=labels[:]
    classList = [word[-1] for word in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataSet[0]) == 1:
        return most(dataSet)
    character = chooseFeature(dataSet)
    node = tempLabels[character]
    myTree = {node: {}}
    del (tempLabels[character])
    featureList = [word[character] for word in dataSet]
    value = set(featureList)
    for i in value:
        newLabels = tempLabels
        myTree[node][i] = createDT(splitDataSet(dataSet, character, i), newLabels)
    return myTree

#分类
def classify(dTree, labels, testData):
    node = list(dTree.keys())[0]
    condition = dTree[node]
    labelIndex = labels.index(node)
    classLabel=None
    for key in condition:
        if testData[labelIndex] == key:
            if type(condition[key]).__name__ == 'dict':
                classLabel=classify(condition[key], labels, testData)
            else:
                classLabel = condition[key]
    return classLabel

#用于将构建好的决策树保存,方便下次使用
def stroeTree(myTree,filename):
    f=open(filename,'wb')
    pickle.dump(myTree,f)
    f.close()
#载入保存的决策树
def loadTree(filename):
    f=open(filename,'rb')
    return pickle.load(f)

dataSet, labels = dataset()
myTree=createDT(dataSet, labels )
stroeTree(myTree,'1')
myTree=loadTree('1')
print(myTree)
print(classify(myTree,labels,['晴', 1, 3, '无']))
posted @ 2022-02-18 17:34  启林O_o  阅读(254)  评论(0编辑  收藏  举报