【作业三】林轩田机器学习技法
这次关注的作业题目是Q13~Q20,主要是实现basic C&RT分类树,以及由其构成的Random Forest。
其中basic C&RT分类树的实现思路如下:
(一)先抽象出来几个功能:
1)从local file读数据并转化成numpy.array的形式(考虑空行容错)(def read_input_data(path))
2)如何根据某个维度的feature,计算这个feature产生的branch criteria(此题中为decision stump)(def learn_decisionStump(x,y))
3)比较多个feature产生的branch criteria,挑选最好的作为当前节点的b(x),并把数据分成两份送到左右子树 (def splited_by_decisionStump(x, y))
4)给定一组y,计算其gini index ( def calculate_GiniIndex(y) )
(二)根据上述几个抽象的功能,用dfs递归方法实现C&RT分类树:
1)终止条件:空(返回空)、输入数据都属于一类(这里是一个坑,不能直接返回空,需要构造一个叶子节点再返回;这里叶子节点的index设为-1,用于标示走到叶子节点)
2)利用抽象出来功能,计算branch criteria,生成新节点
3)将数据分为两组并送到left和right两个分支分别获得分支的sub branch criteria
4)返回这一层新生成的节点
(三)如何保存模型学习结果?
随着树的生长不断进行,记录每个节点的branch criteria,并以binary search tree的形式存放model。
(四)如何预测?
推来一个新的样本,只要根据每一层branch criteria中的index和val,来判断往哪走(注意,这个过程是根sign没关系的);直到走到叶子节点(index==-1判断叶子节点),看叶子节点的sign是啥,输入数据的标签就是啥。
按照(一)~(四),再细心一些,就可以把模型写做出来了。至于后面的Random Forest的过程,就是把先sampling再不断生成C&RT树。
这里没有对高效的sampling方法探究,感觉自己的sampling方式很shit。但是稍微等等,也可以出来正确的结果。
全部代码如下:
#encoding=utf8 import sys import numpy as np import math from random import * ## # tree node for storing C&RT model's decision node # i: feature index # v: decision-stump threshold value # s: decision-stump sign ( direction ) # left: left branch TreeNode # right: right branch TreeNode class TreeNode: def __init__(self, i, v): self.index = i self.val = v self.sign = 0 self.left = None self.right = None ## # read data from local file # return with numpy array def read_input_data(path): x = [] y = [] for line in open(path).readlines(): if line.strip()=='': continue items = line.strip().split(' ') tmp_x = [] for i in range(0,len(items)-1): tmp_x.append(float(items[i])) x.append(tmp_x) y.append(float(items[-1])) return np.array(x),np.array(y) ## # input All data ( binary categories in this context ) # learning decision-stump from the data # splited subdata via learned decision-stump # return two splited data, index, val, sign def splited_by_decisionStump(x, y): # storeing sorted index via all x's certain feature sorted_index = [] for i in range(0, x.shape[1]): sorted_index.append(np.argsort(x[:,i])) # learn the best feature for this node's decision stump n1 = x.shape[0]/2 n2 = x.shape[0]-n1 Branch = float("inf") index = -1 val = 0 for i in range(0, x.shape[1]): # learn decision stump via x[i] xi = x[sorted_index[i], i] yi = y[sorted_index[i]] # minimize cost function of feature i b, v = learn_decisionStump(xi, yi) # update least impuirty parameter (val,sign) if Branch>b: Branch = b index = i val = v # spliting data with best feature and it's val & sign leftX = x[np.where(x[:,index]<val)] leftY = y[np.where(x[:,index]<val)] rightX = x[np.where(x[:,index]>=val)] rightY = y[np.where(x[:,index]>=val)] return leftX, leftY, rightX, rightY, index, val # learn decision-stump threshold from one feature dimension def learn_decisionStump(x,y): # calculate median of interval thetas = np.array([ (x[i]+x[i+1])/2 for i in range(0, x.shape[0]-1) ] ) B = float("inf") target_theta = 0.0 # traversal each median value for theta in thetas: ly = y[np.where(x<theta)] ry = y[np.where(x>=theta)] b = ly.shape[0]*calculate_GiniIndex(ly) + ry.shape[0]*calculate_GiniIndex(ry) if B>b: B = b target_theta = theta return B, target_theta ## # input data ( binary catergories in this context ) # return with Gini Index def calculate_GiniIndex(y): if y.shape[0]==0: return 0 n1 = sum(y==1) n2 = sum(y==-1) if (n1+n2)==0: return 0 return 1.0 - math.pow(1.0*n1/(n1+n2),2) - math.pow(1.0*n2/(n1+n2),2) ## # C&RT tree's dfs learning algorithm # return with learned model within a binary tree def CART(x, y): if x.shape[0]==0: return None # none case if calculate_GiniIndex(y)==0: # terminal case ( only one category ) node = TreeNode(-1, -1) node.sign = 1 if y[0]==1 else -1 return node leftX, leftY, rightX, rightY, index, val = splited_by_decisionStump(x,y) node = TreeNode(index,val) node.left = CART(leftX, leftY) node.right = CART(rightX, rightY) return node ## Q13 # count internal nodes def count_internal_nodes(root): if root==None: return 0 if root.left==None and root.right==None: return 0 print root.index, root.val l = 0 r = 0 if root.left!=None: l = count_internal_nodes(root.left) if root.right!=None: r = count_internal_nodes(root.right) return 1 + l + r ## Q15 # predict def predict(root, x): if root.index==-1: return root.sign if x[root.index]<root.val: return predict(root.left, x) else: return predict(root.right, x) # calculate Eout def calculate_E(model, path): x,y = read_input_data(path) error_count = 0 for i in range(0, x.shape[0]): error_count = error_count + (1 if predict(model, x[i])!=y[i] else 0) return 1.0*error_count/x.shape[0] ## Q16 # Random Forest via Bagging and average Ein(gt) def randomForest(x, y, T): error_rate = 0 trees = [] for i in range(0,T): xi,yi = naive_sampling(x, y) model = CART(xi,yi) error_rate += calculate_E(model,"train.dat") trees.append(model) return error_rate/T, trees # holy shit naive sampling def naive_sampling(x, y): sampleX = np.zeros(x.shape) sampleY = np.zeros(y.shape) for i in range(0, x.shape[0]): index = randint(0, x.shape[0]-1) sampleX[i] = x[index] sampleY[i] = y[index] return sampleX, sampleY ## Q17 Q18 # Ein(G) def calculate_RF_E(trees, path): x,y = read_input_data(path) error_count = 0 for i in range(0, x.shape[0]): yp = rf_predict(trees, x[i]) error_count += 1 if yp!=y[i] else 0 return 1.0*error_count/x.shape[0] # random forest predict process def rf_predict(trees, x): positives = 0 negatives = 0 for tree in trees: yp = predict(tree, x) if yp==1: positives += 1 else: negatives += 1 return 1 if positives>negatives else -1 ## Q19 # prune to only one branch def one_branch_CART(x, y): if x.shape[0]==0: return None # none case if calculate_GiniIndex(y)==0: # terminal case ( only one category ) node = TreeNode(-1, -1) node.sign = 1 if y[0]==1 else -1 return node leftX, leftY, rightX, rightY, index, val = splited_by_decisionStump(x,y) node = TreeNode(index, val) node.left = TreeNode(-1, -1) node.right = TreeNode(-1, -1) ly = y[np.where(x[:,index]<val)] node.left.sign = 1 if sum(ly==1)>sum(ly==-1) else -1 node.right.sign = -node.left.sign return node def one_branch_randomForest(x, y, T): trees = [] for i in range(0,T): xi,yi = naive_sampling(x, y) model = one_branch_CART(xi, yi) trees.append(model) return trees def main(): # x,y = read_input_data("unitTestSplitedByDecisionStump.dat") x,y = read_input_data("train.dat") root = CART(x,y) print count_internal_nodes(root) print calculate_E(root, "test.dat") error_rate,trees = randomForest(x, y, 301) print error_rate print calculate_RF_E(trees, "train.dat") print calculate_RF_E(trees, "test.dat") trees = one_branch_randomForest(x, y, 301) print calculate_RF_E(trees, "train.dat") print calculate_RF_E(trees, "test.dat") if __name__ == '__main__': main()
通过这几道题目,体会了Random Forest的好处:
1)Random Forest的每棵树都不是最强的,但是整合在一起可以很强(如Ein做到0)
2)虽然单棵不剪枝tree的Ein也可以做到0,但是泛化性能(Eout)就比较弱了
3)Random Forest巧用了Bagging的方法,整合了多棵tree:不用剪枝,还增加了泛化性能
4)如果削弱Random Forest中的每棵树的功能,对整体效果是有影响的