COMP9313 Week8 Classification and PySpark MLlib

https://drive.google.com/drive/folders/13_vsxSIEU9TDg1TCjYEwOidh0x3dU6es

https://www.cse.unsw.edu.au/~cs9313/20T2/slides/L7.pdf

 

Machine Learning :

  1. Construct a model, predicting new data

  2. 

 

Evaluation Matrix:

  Positive/Negative: Label ∈{a,b,c,d} 选择a为positive,则其他都是negative

  False Positive:  not a but classified as a

  False Negative: a but classified as b or c or d

  True Positive : a and classified as a

  

  Precision = tp / tp+fp

  Recall = tp / tp+fn

  F1 = 2 * precision*recall / ( precision + recall)

 

  Micro:   True label 是 positive  

  Macro:  mean of F1 of each class label

 

Classification:

  1. Preprocessing and Feature Engineering 

    1) bag of words 

    2) 去高频词

    

 

 

  2.  Train classifier

  3. Evaluate the classifier

    1) split a 'development set' from the training set 

    2)   k-fold cross-validation, 然后取 avg(accuracy)

      

 

 

Text Classification:

  1. Input •Document or sentence

  2.  •Output •Class label C ∈ {c1, c2, … }

  3. Classification methods:

     •Naïve bayes

    •Logistic regression

    •Support-vector machines •…

  4. Naïve Bayes

    1) bag of words -> features变成d维向量,label为c

    2) 最大后验概率

    3)假设条件独立。假设位置无关

    4)  

 

 

    

  

  

 

   

 

 

 

PySpark MLlib:

   

posted @ 2020-07-20 12:16  ChevisZhang  阅读(198)  评论(0编辑  收藏  举报