机器学习笔记（Washington University）- Classification Specialization-week three & week four

1. Quality metric

Quality metric for the desicion tree is the classification error

error=number of incorrect predictions / number of examples

2. Greedy algorithm

Procedure

Step 1: Start with an empty tree

Step 2: Select a feature to split data

explanation:

　　Split data for each feature

　　Calculate classification error of this decision stump

　　choose the one with the lowest error

For each split of the tree:

　　Step 3: If all data in these nodes have same y value

　　　　　　Or if we already use up all the features, stop. 　　　　　　　

　　Step 4: Otherwise go to step 2 and continue on this split

Algorithm

predict(tree_node, input)

if current tree_node is a leaf:

　　return majority class of data points in leaf

else:

　　next_node = child node of tree_node whose feature value agrees with input

　　return (tree_node, input)

3 Threshold split

Threshold split is for the continous input

we just pick a threshold value for the continous input and classify the data.

Procedure:

Step 1: Sort the values of a feature h_j(x) {v₁, v₂,...,v_n}

Step 2: For i = 1 .... N-1(all the data points)

　　　　　　consider split t_i=(v_i+v_i+1)/2

　　　　　　compute the classification error of the aplit

　　　　choose t_i with the lowest classification error

4. Overfitting

As the depth increases, the overfitting could occur.

Curing Methods

1. Early Stopping

Stop learning algorithm before tree become too complex

Like:

Limit the depth of the tree (it is difficult to choose the depth value)
Use classification error to limit depth of the tree (can be dangerous XOR)
Stop if number of data points is too few in the intermedate modes

2.Pruning

Simplify tree after learning algorithm terminates

Consider a specific total cost:

Total cost = classification error + lamda*number of leaf nodes

Start at bottom of tree T and traverse up apply prune_split(T,M) to each desicion node M

prune_split(T,M):

1. Compute the total cost of tree T using the formula above,

C(T) = Error(T)+λL(T)

2. Let T_smaller be tree after pruning subtree below M

3. Compute total cost complexity of T_smaller ,C(T_smaller) = Error(T_smaller)+λL(T_smaller)

4. If C(T_smaller ) < C(T), prune to T_smaller

5. Missing data

1. Purification by skipping data or skipping featuires

Cons:

1. Removing data points or features may remove important info from data

2. Unclear when it is better to remove data points versus features

3. Does not help if data is missing at prediction time

2. Imputation

Filling in the missing value

1. Categorical feature

Fill in the most popular value of x_i

2. Numerical feature

Fill in the average or median value of x_i

Cons:

May result in systematic error

3. Adding missing value choice to every decision node

we use the classification error to decide where to put the unknowns.

Cons:

Requires modification of learning algorithm

posted @ 2017-05-13 09:10 ClimberClimb 阅读(234) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

机器学习笔记（Washington University）- Classification Specialization-week three & week four

公告