for input may be preprocessed
- like fixed size of the image
- also called feature extraction
- speed up computation
- easier to solve problem
supervised learning
- classification- discrete output
- regression- continues output
unsupervised learning
- visualization
- clustering
- density estimation
  ...
technique of reinforcement learning : find suirable actions to take in a given stituation in order to maximized the reward
function which is linear in unknow parameters are called linear model
- for the polynomial model: $y(x, \mathbf{\beta}) = \beta^0 + \beta_1 x + \beta_2 x^2 + .... + \beta_M x^M$ is nonliear in input x but linear in unknow parameter $\beta$
- implement issue:
  - choose the value of coefficents/parameters/weights is determined by fitting data - minimizing error function - cost function
  - choose the order M - model selection/comparision
    - overfitting
    - FACT: as M increases, the magnitude of the coefficients typically gets larger
    - regulazation to fight overfitting
      - involves adding a penalty term to the error function (1.2) in order to discourage the coefficients from reaching large values
      - add $\lambda / 2 \| \mathbf{\beta} \|^2$ $\lambda$ is the importance of the regulazation term;
        
        if use quadric regularizer - call ridge regression
      - weight decay? in neural network
      - $\lambda$ can suppressed over-fitting, reduce value of weight, but if $\lambda$ too large, weight goes to 0, will lead to poor fit

probability theory#

提出probability 是想更加科学一些
expectation
- def: weighted averages of funtions
- if we are given a finite number N of points drawn from the probability distribution or probability density
  - $E[f] ~- 1/N sum_{n=1,2...N}f(x_n)$
- consider expectation of functions of several variables eg f(x,y)
  - expectation have subscript with repesct to different variable:
  - $E_x[f(x,y)]$ and $E_y[f(x,y)]$
variance: $Var[x] = E[x^2] - E[x]^2$
- consider functions of several variables eg f(x,y)
  - covariance: $cov[x,y] = E_{x,y}[{x - E[x]}{y^T - E[y^T]}]$

interpretation of probabilities#

popular: classical or frequentist way
- the probability P of an uncertain event A, P(A) is defined by the frequency of that event based on previous observations
another point of view: Bayesian view - probability provide a quanrification of uncertainty
- for future event, we do not have historical database thus can not count the frequency.
- but can measure the belief in a statement $a$ based on some 'knowledge', denote as P(a|K), different K can generate different P(a|K) and even same K can have different P(a|K) -- the belief is subjective
Bayes rule
- consider conditional probabilities
- $P(A|B) = P(B|A) P(A)/P(B)$
  - interpretation: updating our belief about a hypothesis A in the light of new evidence B
    - in likelihood, it is, output brief of y/A given B/input values+paramters
  - P(A|B): posterior belief
  - P(A): prior belief
  - P(B|A): likelihood, ie the B(our model) will occur if A(the output value of the sample data) is true.
  - P(B) is computed by: $sum_{i=1,2,...}P(B|A_i)P(A_i)$ by marginalisation.
- in machine learning, Bayes theorem is used to convert a priot ptobability $P(A) = P(\mathbf{\beta})$ into a porterior probability $P(A|B) = P(\mathbf{\beta}|y)$ by incorpoating the evidence provided by the observed data
  - for $\mathbf{\beta}$ in the polynormial curve fitting model, we can take an approach with Bayes theorem:
  - $P(\mathbf{\beta} | \mathbf{y}) =\frac{ P(\mathbf{y}|\mathbf{\beta}) p(\mathbf{\beta})}{P(\mathbf{y})} $
    - given data {y_1,y_2,...}, we want to know the $\beta$, cant get directly. $P(\mathbf{\beta} | \mathbf{y})$:= posterior probability
    - $P(\beta)$:= prior probability; our assumption of $\beta$
    - ${P(\mathbf{y})}$:= normalization constant since the given data is fixed
    - $P(\mathbf{y}|\mathbf{\beta})$:= likelihood function;
      - can be view as function of parameter $\beta$
      - not a probability distrubution, so intergral w.r.t $\beta$ not nessary = 1
    - state Bayes theorem as : posterior 8 likelihood × prior, consider all of these as function of parameters $\beta$
    - intergrate both side base on $\beta$: $p(y)= \intergral p(y|\beta)p(\beta)d\beta$
    - issue: particularly the need to marginalize (sum or integrate) over the whole of parameter space

different view of likelihood function#

likelihood function: $P(\mathbf{y}|\mathbf{\beta})$
from frequentist way of interpretation:
- parameter $\beta$ is a fixed parameter, the value is determined by 'estimator'
- A widely used frequentist estimator is maximum likelihood, in which wis set to the value that maximizes the likelihood function
- ie. choosing $\beta$ s.t. probability of the observed data is maximized
- in practice, use negative log of likelihood function = log-likelihood:= error function(monotonically decreasing)
- One approach to determining frequentist error bars is the bootstrap,
  - s1: 就是在已有的dataset(size N)里面random弄出L个dataset(size N) by drawing data from 已有的dataset(抽取方式是，可以重复抽，可以有的没有被抽中)
  - s2: looking at the variability of predictions between the different bootstrap data sets. then evaluate the accuracy of the estimates of the parameter
- drawback: may lead to extreme conclusion if the dataset is bad, eg, a fair-looking coin is tossed three times and lands heads each time. in this case, we will generate parameter $\beta$ to make P(lands head) = 1
from Bayesian viewpoint:

发表于 2015-10-14 03:53 xxxxxxh 阅读(512) 评论(0) 编辑收藏举报

[ml] prml chapter1 notes (undone)

probability theory#

interpretation of probabilities#

different view of likelihood function#

from Bayesian viewpoint: