李宏毅老师机器学习第二课classification
1.Classification
classification: x->function->class n
how to do classification?
train data for classification:
(x1,y^1) (x2,y^2) (x3,y^3) (x4,y^4)
ideal alternatives:
*function (model):
x->g(x)->g(x)>0------->class 1
->g(x)<0-------->class 2
*loss function
L(f)=∑δ(f(xn)!=y^n) the number of times f get incorrect results on training data
*find the best function
example:perceptron,svm
2.Gaussian distribution
*Gaussian distribution fuction fu,Σ(x)=(2π)-1/2Σ-1/2exp(-1/2(x-u)TΣ-1(x-u))
input vector x output:probability of sampling x
the shape of the function determines by vector mean u and covariance matrix Σ
*maxinum likeihood
the Gaussian with any mean u and covariance matrix Σ can generate these point but with different likehood
likehood of a Gaussian with mean u and covariance matrix Σ = the probability of the Gaussion sample x1,x2,x3.....xn
loss function L(u,Σ)=fu,Σ(x1)fu,Σ(x2)fu,Σ(x3).......fu,Σ(x4)
find best parameters u*,Σ*=argmaxL(u,Σ) u*=1/n∑xi Σ*=1/n∑(xi-u*)(xi-u*)T
*classification with Gaussion distribution
Naive Bayes P(c1|x)=P(x|c1)P(c1)/P(x|c2)P(c2)+P(x|c1)P(c1)
P(x|c1):fuc1,Σc1(x) P(x|c2):fuc2,Σc2(x)
*Modifying model
use different uc1,uc2,but use the same Σc1, Σc2,due to less parameters, Σ parameters number proportional to (x parameter)2
Modifying ∑new=(m/m+n)∑c1+(n/m+n)∑c2
*model flaw
use Naive Bayes classifier,all the dimensions are independent
*posterior probability:
P(c1|x)=P(x|c1)P(c1)/P(x|c1)P(c1)+P(x|c2)P(c2)=1/1+P(x|c2)P(c2)/P(x|c1)P(c1)=1/1+exp(-z)=σ(z)=sigmod(z)
z=ln(P(x|c1)P(c1)/P(x|c2)P(c2))
z=wx+b
3.Logistic Regression
Pw,b(c1|x)=σ(z) z=ln(P(x|c1)P(c1)/P(x|c2)P(c2))=wx+b σ(z)=1/1+exp(-z)
*step1 function set: fw,b(x)=Pw,b(c1|x)
*step 2 loss function of Logistic Regression
train data x x1 x2 x3 x4.....xn x1 x2 x3 x4.....xn
y^ c1 c2 c1 c1...... c2 ——> 1 0 1 1 ......0
Assume the data is generated based on fw,b(x)=Pw,b(c1|x)
L(w,b)=fw,b(x1)(1-fw,b(x2))fw,b(x3)fw,b(x4).....(1-fw,b(xn))
L(w,b)=Πfw,b(xi) w*,b*=argmaxL(w,b)=argmin(-lnL(w,b))
-lnL(w,b)=-lnfw,b(x1)-ln(1-fw,b(x2))-lnfw,b(x3)-lnfw,b(x4)........-ln(1-fw,b(xn))
=∑-(y^lnfw,b(xi)+(1-y^)(ln(1-fw,b(xi)))) cross entropy between two Bernoulli distribution
*step3find the best function
δlnfw,b(xn)/δwi=(1-σ(z))xi
δln(1-fw,b(xn))/δwi=-σ(z)
δlnL(w,b)/δwi=∑-(y^n-fw,b(xn))xin
4.Multi-class classification
*softmax
c1:w1,b1 z1=w1+b1 ——> ez1/∑ezj
c2:w2,b2 z2=w2+b2 ——>ez2/∑ezj
c3:w3,b3 z3=w3+b3 ——>ez2/∑ezj
softmax zi——>ezi/∑ezi
probability of softmax: 0<yi<1 ∑yi=1
——>z1 ——>softmax——>y1 loss fuction y^1=[1 0 0]T
x ——>z2 ——>softmax——>y2 <————> y^2=[1 0 0]T
——>z3 ——>softmax——>y3 -∑y^ilnyi y^3=[1 0 0]T
*once Logistic Regression can transformat feature
*cascading logistic regression models
x1 ——>z1——>softmax——>x1'
——>z3——>softmax——>y
x2 ——>z2——>softmax——>x2'
feature transformat Neual classification