针对天气数据的决策树建立
针对天气数据的决策树建立
目标
如下图所示,利用图中天气数据建立决策树,但是对于温度与湿度进行预处理,即将温度与湿度分为4个范围,以这0-3这四个值代替原来的值作为特征。对应关系如下所示:
[60,70)对应0
[70,80)对应1
[80,90)对应2
[90,100)对应3
建立决策树
树的建立
通过循环迭代,创建决策树,当以下两个条件满足时停止迭代:
- 如果剩下待分类数据都是一个类,则终止迭代。
- 如果所有的特征都被用于分类后,剩下的数据仍然不属于同一个类,则选取其中最多的类别作为该叶子的类别
每次迭代选取最好的特征作为这一次分类的特征,然后再将依据此特征分成的数据分别放入下一次迭代,直至停止。
最好特征的选择
采用这样的思想:一个好的特征,依据这个特征分类后,每个类别更加有序,即趋向于属于同一类。因此选择特征时,分别计算依据此特征分类后的各个类的信息熵的加权和,和越小,这个特征越好,选取和最小的作为此次分类的特征。
如对于数据【0,1,1,0,1】根据特征n1分成了两部分a:【0,1】b:【1,0,1】。则先计算每一部分信息熵:
\(a=-\frac{1}{2}\log_2{\frac{1}{2}}-\frac{1}{2}\log_2{\frac{1}{2}}\)
\(b=-\frac{1}{3}\log_2{\frac{1}{3}}-\frac{2}{3}\log_2{\frac{2}{3}}\)
然后计算它们的加权和:
\(\frac{2}{5}a+\frac{3}{5}b\)
这个值就是以特征n1分类的得分,然后以特征n2分类,再计算其得分,直至遍历所有特征,选取得分最低的作为分类特征。
创建结果如下:
{'湿度': {0: '适合', 1: {'天气': {'晴': '适合', '有雨': '不适合', '多云': '适合'}}, 2: {'温度': {0: '适合', 1: {'风况': {'无': '适合', '有': '不适合'}}, 2: '不适合'}}, 3: {'天气': {'晴': '不适合', '有雨': '适合', '多云': '适合'}}}}
可视化如下图所示:
利用决策树分类
利用上面构造好的决策树进行分类,以【'晴', 1, 3, '无'】这条数据为例,按上图所示,首先看湿度,这里为3。所以接着看天气,这里为晴,所以分为不适合运动,最后结果也如上面分析所示。运行结果如下:
代码
import collections
import math
import pickle
import operator
#训练数据存在这里
def dataset():
#温度湿度[60,70)0 [70,80)1 [80,90)2 [90,100)3
labels = ['天气','温度','湿度','风况','运动']
dataSet = [['晴', 2, 2,'无','不适合'],
['晴', 2, 3, '有', '不适合'],
['多云', 2, 1, '无', '适合'],
['有雨', 1, 3, '无', '适合'],
['有雨', 0, 2, '无', '适合'],
['有雨', 0, 1, '有', '不适合'],
['多云', 0, 0, '有', '适合'],
['晴', 1, 3, '无', '不适合'],
['晴', 0, 1, '无', '适合'],
['有雨', 1, 2, '无', '适合'],
['晴', 1, 1, '有', '适合'],
['多云', 1, 3, '有', '适合'],
['多云', 2, 1, '无', '适合'],
['有雨', 1, 2, '有', '不适合']]
return dataSet, labels
#计算信息熵
def cal_entropy(dataset):
length = len(dataset)
entropy = 0
count = {}
for i in dataSet:
label = i[-1]
count[label] = count.get(label, 0) + 1
for key in count:
p = count[key] / length
entropy = entropy - p * math.log(p, 2)
return entropy
#划分数据集
def splitDataSet(dataSet, axis, value):
childDataSet = []
for i in dataSet:
if i[axis] == value:
childList = i[:axis]
childList.extend(i[axis + 1:])
childDataSet.append(childList)
# print(childDataSet)
return childDataSet
#选择最好的特征
def chooseFeature(dataset):
old_entropy = cal_entropy(dataset)
character = -1
for i in range(len(dataset[0]) - 1):
newEntropy = 0
featureList = [word[i] for word in dataset]
value = set(featureList)
for j in value:
childDataSet = splitDataSet(dataset, i, j)
newEntropy += len(childDataSet) / len(dataset) * cal_entropy(childDataSet)
if (newEntropy < old_entropy):
character = i
old_entropy = newEntropy
return character
#当遍历完所有特征时,用于选取当前数据集中最多的一个类别代表该类别
def most(classList):
classCount = {}
for i in range(len(classList)):
classCount[i] = classCount.get(i, 0) + 1
sortCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
# print(sortCount)
return sortCount[0][0]
#构造决策树
def createDT(dataSet, labels):
# print(dataSet)
tempLabels=labels[:]
classList = [word[-1] for word in dataSet]
if classList.count(classList[0]) == len(classList):
return classList[0]
if len(dataSet[0]) == 1:
return most(dataSet)
character = chooseFeature(dataSet)
node = tempLabels[character]
myTree = {node: {}}
del (tempLabels[character])
featureList = [word[character] for word in dataSet]
value = set(featureList)
for i in value:
newLabels = tempLabels
myTree[node][i] = createDT(splitDataSet(dataSet, character, i), newLabels)
return myTree
#分类
def classify(dTree, labels, testData):
node = list(dTree.keys())[0]
condition = dTree[node]
labelIndex = labels.index(node)
classLabel=None
for key in condition:
if testData[labelIndex] == key:
if type(condition[key]).__name__ == 'dict':
classLabel=classify(condition[key], labels, testData)
else:
classLabel = condition[key]
return classLabel
#用于将构建好的决策树保存,方便下次使用
def stroeTree(myTree,filename):
f=open(filename,'wb')
pickle.dump(myTree,f)
f.close()
#载入保存的决策树
def loadTree(filename):
f=open(filename,'rb')
return pickle.load(f)
dataSet, labels = dataset()
myTree=createDT(dataSet, labels )
stroeTree(myTree,'1')
myTree=loadTree('1')
print(myTree)
print(classify(myTree,labels,['晴', 1, 3, '无']))