COMP9313 Week8 Classification and PySpark MLlib


Machine Learning :

  1. Construct a model, predicting new data



Evaluation Matrix:

  Positive/Negative: Label ∈{a,b,c,d} 选择a为positive,则其他都是negative

  False Positive:  not a but classified as a

  False Negative: a but classified as b or c or d

  True Positive : a and classified as a


  Precision = tp / tp+fp

  Recall = tp / tp+fn

  F1 = 2 * precision*recall / ( precision + recall)


  Micro:   True label 是 positive  

  Macro:  mean of F1 of each class label



  1. Preprocessing and Feature Engineering 

    1) bag of words 

    2) 去高频词




  2.  Train classifier

  3. Evaluate the classifier

    1) split a 'development set' from the training set 

    2)   k-fold cross-validation, 然后取 avg(accuracy)




Text Classification:

  1. Input •Document or sentence

  2.  •Output •Class label C ∈ {c1, c2, … }

  3. Classification methods:

     •Naïve bayes

    •Logistic regression

    •Support-vector machines •…

  4. Naïve Bayes

    1) bag of words -> features变成d维向量,label为c

    2) 最大后验概率













PySpark MLlib:


posted @ 2020-07-20 12:16  ChevisZhang  阅读(198)  评论(0编辑  收藏  举报