机器学习笔记(Washington University)- Classification Specialization-week three & week four

1. Quality metric

Quality metric for the desicion tree is the classification error

error=number of incorrect  predictions / number of examples

 

2. Greedy algorithm

Procedure

Step 1: Start with an empty tree

Step 2: Select a feature to split data

explanation:

  Split data for each feature 

  Calculate classification error of this decision stump

  choose the one with the lowest error

For each split of the tree:

  Step 3: If all data in these nodes have same y value

      Or if we already use up all the features, stop.        

  Step 4: Otherwise go to step 2 and continue on this split

Algorithm

predict(tree_node, input)

if current tree_node is a leaf:

  return majority class of data points in leaf

else:

  next_node = child node of tree_node whose feature value agrees with input

  return (tree_node, input)

3  Threshold split

Threshold split is for the continous input

we just pick a threshold value for the continous input and classify the data.

Procedure:

Step 1: Sort the values of a feature hj(x) {v1, v2,...,vn}

Step 2: For i = 1 .... N-1(all the data points)

      consider split ti=(vi+vi+1)/2

      compute the classification error of the aplit

    choose ti with the lowest classification error

 

4. Overfitting

As the depth increases, the overfitting could occur.

Curing Methods

1. Early Stopping

Stop learning algorithm before tree become too complex

Like:

  • Limit the depth of the tree  (it is difficult to choose the depth value)
  • Use classification error to limit depth of the tree (can be dangerous  XOR)
  • Stop if number of data points is too few in the intermedate modes

2.Pruning

Simplify tree after learning algorithm terminates

Consider a specific total cost:

Total cost = classification error + lamda*number of leaf nodes

Start at bottom of tree T and traverse up apply prune_split(T,M) to each desicion node M

prune_split(T,M):

1. Compute the total cost of tree T using the formula above,

C(T) = Error(T)+λL(T)

2. Let Tsmaller be tree after pruning subtree below M

3. Compute total cost complexity of Tsmaller ,C(Tsmaller) = Error(Tsmaller)+λL(Tsmaller)

4. If C(Tsmaller ) < C(T), prune to Tsmaller 

 

5. Missing data

1. Purification by skipping data or skipping featuires

Cons:

1. Removing data points or features may remove important info from data

2. Unclear when it is better to remove data points versus features

3. Does not help if data is missing at prediction time

 

2. Imputation

Filling in the missing value

1. Categorical feature

Fill in the most popular value of xi

2. Numerical feature 

Fill in the average or median value of xi

Cons:

May result in systematic error

 

3. Adding missing value choice to every decision node

we use the classification error to decide where to put the unknowns.

Cons:

Requires modification of learning algorithm

 

posted @ 2017-05-13 09:10  ClimberClimb  阅读(234)  评论(0编辑  收藏  举报