COMP9517 week5 机器学习

https://echo360.org.au/lesson/29e8e4ac-4007-4e31-b682-aa9da17ddd36/classroom#sortDirection=desc

https://webcms3.cse.unsw.edu.au/static/uploads/course/COMP9517/20T2/d553e782841ce957695fba698756b72f159c2ebae4e65061c4d361b642e2ca85/COMP9517_20T2W5_Pattern_Recognition_Part_2.pdf

 

总结:

  1.Pattern recognition

  

 

 

   

 

 

   

  2. Nearest Class Mean Classifier

    1) 计算 k 类样本的class mean vector 

    2) test set的输入向量X,被分到一类,if it is much closer to the mean vector of class 𝑘 than to any other class mean vector

    3)   X与 mean vector 之间的距离用 Euclidean distance 计算

    4) 如果某类k,有两个子类,则分别计算两个子类的 mean vector

    5) Pros : classes之间间隔远,区别明显时,该模型效果好

    6) Cons: 样本复杂,多类时,使用效果差;不能处理outliers,missing data

    

 

 

     

 

 

   3.K-nearest Neighbours (KNN)

    1) sample会被分发label,该label是KNN中的多数类

    2) 计算离sample最近的k个邻居时,计算距离的方法一般使用(计算连续various时)Euclidean distance

    3) for discrete variables, Hamming distance.

    4) Pros: 

      1.没有 training step

      2. Decision surfaces are non-linear

    5) Cons:

      1. 数据集大的时候非常慢,O(N^2)

      2. 当features很多,维度很大的时候,效果不好,解释性差(curese of dimensionality)

      

 

   

  4. Bayesian Decision Theory 

    1)  分类器的分类可能不正确,所以分类结果应该设为概率;Object会被分到概率最大的class

    2)

    

    

 

     

 

   5. Decision Tree

    1)  完全分类时,熵是0,没完全分类时,是正数,所以要做的是minimize(entropy)

    2) 分类越好,entropy就越小,information gain就越大

    3) 从哪个feature分割,information gain大,就用哪个feature做node

    4) Pros:

      1. 很好解释

      2. 能处理numerical与categorical数据

      3. 能处理outliers,missingvalue 

      4. 能区别features的importance

    5)Cons: tend to overfit

 

    

 

     

 

     

    

 

     

   

  6. 集成学习,RandomForest

    1)error rate

      1. RF的error rate受forests'correlation以及individual tree's strength

      2. corelation越大strength越小,会导致error_rate越大

      3. forests'correlation与strength都与每棵树选择的features数量m成正相关关系

      4. 所以需要对m进行trade-off

    2)pros

      • unexcelled in accuracy among current algorithms

      • works efficiently on large datasets

      • handles thousands of input features without feature selection

      • handles missing values effectively

    3)   Cons:

      • less interpretable than an individual decision tree

      • More complex and more time-consuming to construct than decision trees

 

  7. SVM

posted @ 2020-07-03 16:43  ChevisZhang  阅读(319)  评论(0编辑  收藏  举报