week 1 - machine learning - Andrew ng- coursera

 

week1

week1

1 week 1

 

1.1 intro

 

1.1.1 what is ML?

  1. definition
  2. the field of study that gives computers the ability to learn without being explicitly programmedd. (Arthur Samuel, 1959)

Tom mitchell (1998) well-posed learning problem: a computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

E experience
T task
P performance
  1. classifications:
  2. Supervised learning
  3. Unsupervised learning.

1.1.2 supervised learning

categories:

  1. regression problem - continuous output
  2. classification problem- discrete output
    • example: email spam/not spam

In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

examples of regression

  • housing price prediction

Example 2:

(a) Regression - Given a picture of a person, we have to predict their age on the basis of the given picture

(b) Classification - Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.

1.1.3 unsupervised learning

example: google news, clustering

applications:

  1. organize computing clusters
  2. social network
  3. market segmentation
  4. astronomical data analysis

example 2 cocktail party

Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds

1.1.4 test 1

wrong answer for the following Q

desc: Some of the problems below are best addressed using a supervised

learning algorithm, and the others with an unsupervised

learning algorithm. Which of the following would you apply

supervised learning to? (Select all that apply.) In each case, assume some appropriate

dataset is available for your algorithm to learn from.

1.2 Linear Regression with One Variable

 

1.2.1 model representation

m
number of traning examples

univariate=one variable x(i) – i element of input variables y(i) – i elements of output variables

h
hypothesis, h(x)=a+bx

house predicting, regression vs classification

target variable type
continuous regression
discrete values classification

When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

Linear regression predicts a real-valued output based on an input value. We discuss the application of linear regression to housing price prediction, present the notion of a cost function, and introduce the gradient descent method for learning.

1.2.2 cost function–J(θ)

fig/ octave code: J=sum(((X*theta-y).2))/(2*m)

an example is the least square error least squre estimation 最小二乘估计法

m
number of traning examples
J(θ) cost function

goal: minimize the cost function. We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's. eq_least_square.png

J(θ0,θ1) =12m∑i=1m(yi−yi)2=12m∑i=1m(hθ(xi)−yi)2J(θ0, θ1) = \dfrac {1}{2m} \displaystyle ∑ {i=1}m \left ( \hat{y}i- yi \right)2 = \dfrac {1}{2m} \displaystyle ∑ _{i=1}m \left (hθ (xi) - yi \right)2J(θ0​,θ1​)=2m1​i=1∑m​(y​i​−yi​)2=2m1​i=1∑m​(hθ​(xi​)−yi​)2

To break it apart, it is 0.5 \bar{x}, where \bar{x}= the mean of the squares of hθ(xi)−yi or the difference between the predicted value and the actual value.

This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved (12)\left(\frac{1}{2}\right)(21​) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 12\frac{1}{2}21​ term. The following image summarizes what the cost function does:

contour plot
a graph that contains many contour lines
(no term)
feature:
  • a contour line of a two variable function has a constant value at all points of the same line.

1.2.3 gradient descent-梯度下降

goal: minimize the cost function J(θ1​)

an algorithm for automatically finding that value of theta0 and theta1 that minimizes the cost function

We put theta0​ on the x axis and θ1​ on the y axis, with the cost function on the vertical z axis.

https://www.hackerearth.com/blog/developers/gradient-descent-algorithm-linear-regression/

1.2.4 gradient descent for linear regression

1.3 quiz 1

1.4 linear algebra review

This optional module provides a refresher on linear algebra concepts. Basic understanding of linear algebra is necessary for the rest of the course, especially as we begin to cover models with multiple variables.

1.4.1 matrices and vectors

  • Aij​ refers to the element in the ith row and jth column of matrix A.
  • A vector with 'n' rows is referred to as an 'n'-dimensional vector.
  • vi​ refers to the element in the ith row of the vector.
  • In general, all our vectors and matrices will be 1-indexed. Note that for some programming languages, the arrays are 0-indexed.
  • Matrices are usually denoted by uppercase names while vectors are lowercase.
  • "Scalar" means that an object is a single value, not a vector or matrix.
  • R refers to the set of scalar real numbers.
  • Rn refers to the set of n-dimensional vectors of real numbers.

% The ; denotes we are going back to a new row.

> A = [1, 2, 3; 4, 5, 6; 7, 8, 9; 10, 11, 12]

A =

    1    2    3
    4    5    6
    7    8    9
   10   11   1

% Get the dimension of the matrix A where m = rows and n = columns >[m,n] = size(A)

m = 4 n = 3 % You could also store it this way

dim_A = size(A)

dimA =

4 3

% let's index into the 2nd row 3rd column of matrix A A23 = A(2,3)

A_23 =  6

% Initialize a vector

v = [1;2;3]

v =

1 2 3

% Get the dimension of the vector v dimv = size(v)

dim_v =

   3   1

1.4.2 addition and scalar multiplication

+ addition
- subtraction

note:To add or subtract two matrices, their dimensions must be the same.

[a ​b c ​d​]+[w ​x y ​z​]=[a+w ​b+x c+y ​d+z​]

[a ​b c ​d​]−[w ​x y ​z​]=[a−w ​b−x c−y ​d−z​]

% Initialize matrix A and B A = [1, 2, 4; 5, 3, 2] B = [1, 3, 4; 1, 1, 1]

A =
   1   2   4
   5   3   2

% Initialize constant s s = 2

% See how element-wise addition works addAB = A + B

% See how element-wise subtraction works subAB = A - B

% See how scalar multiplication works multAs = A * s

% Divide A by s divAs = A / s

% What happens if we have a Matrix + scalar? addAs = A + s addAs =

3 4 6 7 5 4

1.4.3 matrix vector multiplication

we map the column of the vector onto each row of the matrix, multiplying each element and summing the result. [a b; c d; e f]∗[x; y]=[a∗x+b∗y; c∗x+d∗y; e∗x+f∗y] The result is a vector. The number of columns of the matrix must equal the number of rows of the vector.

An m x n matrix multiplied by an n x 1 vector results in an m x 1 vector.

exercise

% Initialize matrix A 
A = [1, 2, 3; 4, 5, 6;7, 8, 9] 

% Initialize vector v 
v = [1; 1; 1] 

% Multiply A * v
Av = A * v

example of an application: house prediction matrix_vector_example.png

1.4.4 matrix matrix multiplication

% Initialize a 3 by 2 matrix A = [1, 2; 3, 4; 5, 6]

% Initialize a 2 by 1 matrix B = [1; 2]

% We expect a resulting matrix of (3 by 2)*(2 by 1) = (3 by 1) multAB = A*B

% Make sure you understand why we got that result

example of an application: house prediction matrix_matrix_example.png

1.4.5 matrix multiplication properties

  1. matrix are not comutative Matrices are not commutative: A∗B≠B∗A
  1. associative, yes Matrices are associative: (A∗B)∗C=A∗(B∗C)=(A∗B)∗C
  2. indentity matrix

exercise

% Initialize random matrices A and B A = [1,2;4,5] B = [1,1;0,2]

% Initialize a 2 by 2 identity matrix I = eye(2)

% The above notation is the same as I = [1,0;0,1]

% What happens when we multiply I*A ? IA = I*A

% How about A*I ? AI = A*I

% Compute A*B AB = A*B

% Is it equal to B*A? BA = B*A

% Note that IA = AI but AB != BA

1.4.6 inverse and transpose

The inverse of a matrix A is denoted A-1. Multiplying by the inverse results in the identity matrix. > A=[1 2 3; 4 5 6]; > pinv(A) % octave > inv(A) % matlab > transpose(A); A non square matrix does not have an inverse matrix. We can compute inverses of matrices in octave with the pinv(A) function and in Matlab with the inv(A) function. Matrices that don't have an inverse are singular or degenerate.

The transposition of a matrix is like rotating the matrix 90° in clockwise direction and then reversing it. We can compute transposition of matrices in matlab with the transpose(A) function or A':

1.4.7 submit program code

 

>> cd E:\tutorials\ml_andrew\code\ex4_w5
>> submit()
== Submitting solutions | Neural Networks Learning...
Login (email address): kaiming.ai@outlook.com
Token: 5efmxfLOE25iCtlU

Created: 2021-12-06 Mon 22:14

Emacs 25.3.1 (Org mode 8.2.10)

Validate

posted @ 2021-12-06 22:22  kaiming_ai  阅读(52)  评论(0编辑  收藏  举报