• for input may be preprocessed

    • like fixed size of the image
    • also called feature extraction
    • speed up computation
    • easier to solve problem
  • supervised learning

    • classification- discrete output
    • regression- continues output
  • unsupervised learning

    • visualization
    • clustering
    • density estimation
      ...
  • technique of reinforcement learning : find suirable actions to take in a given stituation in order to maximized the reward

  • function which is linear in unknow parameters are called linear model

    • for the polynomial model: \(y(x, \mathbf{\beta}) = \beta^0 + \beta_1 x + \beta_2 x^2 + .... + \beta_M x^M\) is nonliear in input x but linear in unknow parameter \(\beta\)
    • implement issue:
      • choose the value of coefficents/parameters/weights is determined by fitting data - minimizing error function - cost function
      • choose the order M - model selection/comparision
        • overfitting
        • FACT: as M increases, the magnitude of the coefficients typically gets larger
        • regulazation to fight overfitting
          • involves adding a penalty term to the error function (1.2) in order to discourage the coefficients from reaching large values
          • add \(\lambda / 2 \| \mathbf{\beta} \|^2\) \(\lambda\) is the importance of the regulazation term;
            • if use quadric regularizer - call ridge regression
          • weight decay? in neural network
          • \(\lambda\) can suppressed over-fitting, reduce value of weight, but if \(\lambda\) too large, weight goes to 0, will lead to poor fit

probability theory#

  • 提出probability 是想更加科学一些
  • expectation
    • def: weighted averages of funtions
    • if we are given a finite number N of points drawn from the probability distribution or probability density
      • \(E[f] ~- 1/N sum_{n=1,2...N}f(x_n)\)
    • consider expectation of functions of several variables eg f(x,y)
      • expectation have subscript with repesct to different variable:
      • \(E_x[f(x,y)]\) and \(E_y[f(x,y)]\)
  • variance: \(Var[x] = E[x^2] - E[x]^2\)
    • consider functions of several variables eg f(x,y)
      • covariance: \(cov[x,y] = E_{x,y}[{x - E[x]}{y^T - E[y^T]}]\)

interpretation of probabilities#

  • popular: classical or frequentist way
    • the probability P of an uncertain event A, P(A) is defined by the frequency of that event based on previous observations
  • another point of view: Bayesian view - probability provide a quanrification of uncertainty
    • for future event, we do not have historical database thus can not count the frequency.
    • but can measure the belief in a statement \(a\) based on some 'knowledge', denote as P(a|K), different K can generate different P(a|K) and even same K can have different P(a|K) -- the belief is subjective
  • Bayes rule
    • consider conditional probabilities
    • \(P(A|B) = P(B|A) P(A)/P(B)\)
      • interpretation: updating our belief about a hypothesis A in the light of new evidence B
        • in likelihood, it is, output brief of y/A given B/input values+paramters
      • P(A|B): posterior belief
      • P(A): prior belief
      • P(B|A): likelihood, ie the B(our model) will occur if A(the output value of the sample data) is true.
      • P(B) is computed by: \(sum_{i=1,2,...}P(B|A_i)P(A_i)\) by marginalisation.
    • in machine learning, Bayes theorem is used to convert a priot ptobability \(P(A) = P(\mathbf{\beta})\) into a porterior probability \(P(A|B) = P(\mathbf{\beta}|y)\) by incorpoating the evidence provided by the observed data
      • for \(\mathbf{\beta}\) in the polynormial curve fitting model, we can take an approach with Bayes theorem:
      • $P(\mathbf{\beta} | \mathbf{y}) =\frac{ P(\mathbf{y}|\mathbf{\beta}) p(\mathbf{\beta})}{P(\mathbf{y})} $
        • given data {y_1,y_2,...}, we want to know the \(\beta\), cant get directly. \(P(\mathbf{\beta} | \mathbf{y})\):= posterior probability
        • \(P(\beta)\):= prior probability; our assumption of \(\beta\)
        • \({P(\mathbf{y})}\):= normalization constant since the given data is fixed
        • \(P(\mathbf{y}|\mathbf{\beta})\):= likelihood function;
          • can be view as function of parameter \(\beta\)
          • not a probability distrubution, so intergral w.r.t \(\beta\) not nessary = 1
        • state Bayes theorem as : posterior 8 likelihood × prior, consider all of these as function of parameters \(\beta\)
        • intergrate both side base on \(\beta\): \(p(y)= \intergral p(y|\beta)p(\beta)d\beta\)
        • issue: particularly the need to marginalize (sum or integrate) over the whole of parameter space

different view of likelihood function#

  • likelihood function: \(P(\mathbf{y}|\mathbf{\beta})\)
  • from frequentist way of interpretation:
    • parameter \(\beta\) is a fixed parameter, the value is determined by 'estimator'
    • A widely used frequentist estimator is maximum likelihood, in which wis set to the value that maximizes the likelihood function
    • ie. choosing \(\beta\) s.t. probability of the observed data is maximized
    • in practice, use negative log of likelihood function = log-likelihood:= error function(monotonically decreasing)
    • One approach to determining frequentist error bars is the bootstrap,
      • s1: 就是在已有的dataset(size N)里面random弄出L个dataset(size N) by drawing data from 已有的dataset(抽取方式是,可以重复抽,可以有的没有被抽中)
      • s2: looking at the variability of predictions between the different bootstrap data sets. then evaluate the accuracy of the estimates of the parameter
    • drawback: may lead to extreme conclusion if the dataset is bad, eg, a fair-looking coin is tossed three times and lands heads each time. in this case, we will generate parameter \(\beta\) to make P(lands head) = 1
  • from Bayesian viewpoint: