week 1 - machine learning - Andrew ng- coursera
week1
week1
Table of Contents
1 week 1
1.1 intro
1.1.1 what is ML?
- definition
- the field of study that gives computers the ability to learn without being explicitly programmedd. (Arthur Samuel, 1959)
Tom mitchell (1998) well-posed learning problem: a computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Example: playing checkers.
E = the experience of playing many games of checkers
T = the task of playing checkers.
P = the probability that the program will win the next game.
E | experience |
T | task |
P | performance |
- classifications:
- Supervised learning
- Unsupervised learning.
1.1.2 supervised learning
categories:
- regression problem - continuous output
- classification problem- discrete output
- example: email spam/not spam
In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.
examples of regression
- housing price prediction
Example 2:
(a) Regression - Given a picture of a person, we have to predict their age on the basis of the given picture
(b) Classification - Given a patient with a tumor, we have to predict whether the tumor is malignant or benign.
1.1.3 unsupervised learning
example: google news, clustering
applications:
- organize computing clusters
- social network
- market segmentation
- astronomical data analysis
example 2 cocktail party
Non-clustering: The "Cocktail Party Algorithm", allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds
1.1.4 test 1
wrong answer for the following Q
desc: Some of the problems below are best addressed using a supervised
learning algorithm, and the others with an unsupervised
learning algorithm. Which of the following would you apply
supervised learning to? (Select all that apply.) In each case, assume some appropriate
dataset is available for your algorithm to learn from.
1.2 Linear Regression with One Variable
1.2.1 model representation
- m
- number of traning examples
univariate=one variable x(i) – i element of input variables y(i) – i elements of output variables
- h
- hypothesis, h(x)=a+bx
house predicting, regression vs classification
target variable | type |
continuous | regression |
discrete values | classification |
When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.
Linear regression predicts a real-valued output based on an input value. We discuss the application of linear regression to housing price prediction, present the notion of a cost function, and introduce the gradient descent method for learning.
1.2.2 cost function–J(θ)
fig/ octave code: J=sum(((X*theta-y).2))/(2*m)
an example is the least square error least squre estimation 最小二乘估计法
- m
- number of traning examples
J(θ) | cost function |
goal: minimize the cost function. We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output y's.
J(θ0,θ1) =12m∑i=1m(yi−yi)2=12m∑i=1m(hθ(xi)−yi)2J(θ0, θ1) = \dfrac {1}{2m} \displaystyle ∑ {i=1}m \left ( \hat{y}i- yi \right)2 = \dfrac {1}{2m} \displaystyle ∑ _{i=1}m \left (hθ (xi) - yi \right)2J(θ0,θ1)=2m1i=1∑m(yi−yi)2=2m1i=1∑m(hθ(xi)−yi)2
To break it apart, it is 0.5 \bar{x}, where \bar{x}= the mean of the squares of hθ(xi)−yi or the difference between the predicted value and the actual value.
This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved (12)\left(\frac{1}{2}\right)(21) as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the 12\frac{1}{2}21 term. The following image summarizes what the cost function does:
- contour plot
- a graph that contains many contour lines
- (no term)
- feature:
- a contour line of a two variable function has a constant value at all points of the same line.
1.2.3 gradient descent-梯度下降
goal: minimize the cost function J(θ1)
an algorithm for automatically finding that value of theta0 and theta1 that minimizes the cost function
We put theta0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis.
https://www.hackerearth.com/blog/developers/gradient-descent-algorithm-linear-regression/
1.2.4 gradient descent for linear regression
1.3 quiz 1
1.4 linear algebra review
This optional module provides a refresher on linear algebra concepts. Basic understanding of linear algebra is necessary for the rest of the course, especially as we begin to cover models with multiple variables.
1.4.1 matrices and vectors
- Aij refers to the element in the ith row and jth column of matrix A.
- A vector with 'n' rows is referred to as an 'n'-dimensional vector.
- vi refers to the element in the ith row of the vector.
- In general, all our vectors and matrices will be 1-indexed. Note that for some programming languages, the arrays are 0-indexed.
- Matrices are usually denoted by uppercase names while vectors are lowercase.
- "Scalar" means that an object is a single value, not a vector or matrix.
- R refers to the set of scalar real numbers.
- Rn refers to the set of n-dimensional vectors of real numbers.
% The ; denotes we are going back to a new row.
> A = [1, 2, 3; 4, 5, 6; 7, 8, 9; 10, 11, 12]
A = 1 2 3 4 5 6 7 8 9 10 11 1
% Get the dimension of the matrix A where m = rows and n = columns >[m,n] = size(A)
m = 4 n = 3 % You could also store it this way
dim_A = size(A)
dimA =
4 3
% let's index into the 2nd row 3rd column of matrix A A23 = A(2,3)
A_23 = 6
% Initialize a vector
v = [1;2;3]
v =
1 2 3
% Get the dimension of the vector v dimv = size(v)
dim_v = 3 1
1.4.2 addition and scalar multiplication
+ | addition |
- | subtraction |
note:To add or subtract two matrices, their dimensions must be the same.
[a b c d]+[w x y z]=[a+w b+x c+y d+z]
[a b c d]−[w x y z]=[a−w b−x c−y d−z]
% Initialize matrix A and B A = [1, 2, 4; 5, 3, 2] B = [1, 3, 4; 1, 1, 1]
A = 1 2 4 5 3 2
% Initialize constant s s = 2
% See how element-wise addition works addAB = A + B
% See how element-wise subtraction works subAB = A - B
% See how scalar multiplication works multAs = A * s
% Divide A by s divAs = A / s
% What happens if we have a Matrix + scalar? addAs = A + s addAs =
3 4 6 7 5 4
1.4.3 matrix vector multiplication
we map the column of the vector onto each row of the matrix, multiplying each element and summing the result. [a b; c d; e f]∗[x; y]=[a∗x+b∗y; c∗x+d∗y; e∗x+f∗y] The result is a vector. The number of columns of the matrix must equal the number of rows of the vector.
An m x n matrix multiplied by an n x 1 vector results in an m x 1 vector.
exercise
% Initialize matrix A A = [1, 2, 3; 4, 5, 6;7, 8, 9] % Initialize vector v v = [1; 1; 1] % Multiply A * v Av = A * v
example of an application: house prediction
1.4.4 matrix matrix multiplication
% Initialize a 3 by 2 matrix A = [1, 2; 3, 4; 5, 6]
% Initialize a 2 by 1 matrix B = [1; 2]
% We expect a resulting matrix of (3 by 2)*(2 by 1) = (3 by 1) multAB = A*B
% Make sure you understand why we got that result
example of an application: house prediction
1.4.5 matrix multiplication properties
- matrix are not comutative Matrices are not commutative: A∗B≠B∗A
- associative, yes Matrices are associative: (A∗B)∗C=A∗(B∗C)=(A∗B)∗C
- indentity matrix
exercise
% Initialize random matrices A and B A = [1,2;4,5] B = [1,1;0,2]
% Initialize a 2 by 2 identity matrix I = eye(2)
% The above notation is the same as I = [1,0;0,1]
% What happens when we multiply I*A ? IA = I*A
% How about A*I ? AI = A*I
% Compute A*B AB = A*B
% Is it equal to B*A? BA = B*A
% Note that IA = AI but AB != BA
1.4.6 inverse and transpose
The inverse of a matrix A is denoted A-1. Multiplying by the inverse results in the identity matrix. > A=[1 2 3; 4 5 6]; > pinv(A) % octave > inv(A) % matlab > transpose(A); A non square matrix does not have an inverse matrix. We can compute inverses of matrices in octave with the pinv(A) function and in Matlab with the inv(A) function. Matrices that don't have an inverse are singular or degenerate.
The transposition of a matrix is like rotating the matrix 90° in clockwise direction and then reversing it. We can compute transposition of matrices in matlab with the transpose(A) function or A':
1.4.7 submit program code
>> cd E:\tutorials\ml_andrew\code\ex4_w5
>> submit()
== Submitting solutions | Neural Networks Learning...
Login (email address): kaiming.ai@outlook.com
Token: 5efmxfLOE25iCtlU