Machine Learning 【note_01】
Declaration (2023/06/02):
This note is the first note of a series of machine learning notes.
At present, the main learning resource is the 2022 Andrew Y. Ng machine learning Deeplearning.ai course, from which most of the knowledge and some pictures in the notes come.
For convenience, the video resource comes from 【啥都会一点的研究生】 on Bilibili and course materials is pulled from GitHub.
Here I would recommend watching the original video.
- Course address: https://www.coursera.org/specializations/machine-learning-introduction
- Video address: https://www.bilibili.com/video/BV1Pa411X76s/?p=1
- GitHub resource: https://github.com/kaieye/2022-Machine-Learning-Specialization
Machine Learning 【note_01】
Keywords: Linear Regression, Cost Function, Gradient Descent
1 Basic knowledge
1.1 machine learning
-
supervised learning
-
used most in real-world applications
-
most rapid advancement and innovation
-
-
unsupervised learning
reinforcement learning
practical advice for applying learning algorithms
1.2 supervised learning
- Input x and y pairs, get mapping relationship, calculate y for new x.
- Learns from being given "right answers".
- two kinds:
- regression:
- predict a number
- infinitely many possible output
- classification:
- predict categories
- small number of possible output
- regression:
1.3 unsupervised learning
-
Find something interesting in unlabeled data.
-
Input x and label y, find structure in data, to get label y of x without given label.
-
three kind:
- Clustering: group similar data points together.
-
Examples: Google news, DNA micro-array, grouping customers
2. Anomaly detection: find unusual data points.
- Dimensionality reduction: Compress data using fewer numbers while losing as little information as possible.
2 Linear regression
2.1 Definition
Supervised learning models data has "right answers".
Regression model predicts numbers.
2.2 An example

2.3 Terminology
- Training set: Data used to train the model.
- : "input" variable/feature
- : "output/target" variable/feature
- : number of training examples
- : single training example
- : training example
- Model
- : model/function, take a new input x and output the prediction for y
- : estimate/prediction for y
- and : parameters of the model, coefficients / weights and intercept

Linear regression with one variable.
Univariate linear regression.
3 Cost Function
3.1 Definition
Cost function: Squared error cost function

习惯上,把代价函数 的系数做 计算,可以在求导的时候方便一些:
设定代价函数的目的:minimize
3.2 Visualization
为了便于理解代价函数,可以做以下可视化:
3.2.1 J(w)
Set as zero, and see how is plotted.

3.2.2 J(w,b)
See how is visualized.



4 Gradient Descent
4.1 Conception
goal: minimize
outline:
- start with some and
- keep changing , to reduce
- until settle at or near a minimum
There may be more than one possible minimum.
gradient descent:

local minima:

4.2 Implementation
gradient descent algorithm
repeat until convergence {
}
Here, the is called the learning rate. The is a partial derivative term.
Simultaneously update and .

4.3 Understanding
When gradient descent algorithm run, is decreased like below:

If is too small, gradient descent may be slow.
If is too small, gradient descent may:
- overshoot, never reach minimum
- fail to converge, diverge

Can reach local minimum with fixed learning rate:

When it has already reached a minimum, gradient descent leaves unchanged:

4.4 For linear regression
Linear regression model:
Cost function:
gradient descent algorithm
repeat until convergence {
}
Here,

Linear regression's cost function always have only one global minimum:

4.5 Running
See how gradient descent running in an example:

"Batch" gradient descent
- "Batch": each step of gradient descent uses all the training examples (instead of just a subset of the training data).
5 Multiple Linear Regression
5.1 Conception

multiple features (variables)
- : feature
- : number of features
- : features of training example
- : value of feature in training example

Model:
So, it could be written as :
5.2 Vectorization
In Python, we use NumPy
to calculate vectorized parameters and features:
- Run faster
- Shorter of code

Compare running with and without vectorization:
- Parallel computing, is more efficient for large scale datasets.

Same in gradient descent:

Try in Python code:
np.random.seed(1)
a = np.random.rand(10000000) # very large arrays
b = np.random.rand(10000000)
tic = time.time() # capture start time
c = np.dot(a, b)
toc = time.time() # capture end time
print(f"np.dot(a, b) = {c:.4f}")
print(f"Vectorized version duration: {1000*(toc-tic):.4f} ms ")
tic = time.time() # capture start time
c = my_dot(a,b)
toc = time.time() # capture end time
print(f"my_dot(a, b) = {c:.4f}")
print(f"loop version duration: {1000*(toc-tic):.4f} ms ")
del(a);del(b) #remove these big arrays from memory
np.dot(a, b) = 2501072.5817
Vectorized version duration: 10.9415 ms
my_dot(a, b) = 2501072.5817
loop version duration: 2575.1283 ms
5.3 Gradient Descent
Model:
Cost function:

To calculate partial derivative term:

There's an alternative to gradient descent, which is called Normal Equation:
-
Only for linear regression.
-
Solve for , without iterations.
-
Disadvantages:
- Does not generalize to other learning algorithms (like Logistic Regression or other subsequent machine learning methods).
- Slow when number of features is large (>10,000).
-
It may be used in ML libraries that implement linear regression; however, gradient descent is the recommended method for finding parameters , .
5.4 Important things
5.4.1 Feature Scaling
5.4.1.1 Why
An example:

We can see that the scales of different variables vary greatly, which may cause problems in gradient descent:

So we need to rescale feature:

5.4.1.2 How
Simply divide by the maximum value:

Mean normalization:

Z-score normalization:

rescale it when the feature parameters are too large or too small:

5.4.2 Convergence check
Check by the curve:
- The cost should decrease after every iteration.
- If the cost curve rises, it means the learning rate chosen is too big.
- The cost likely converged by 400 iterations.
- It is difficult to know in advance the number of iterations required.

Or, we can set the automatic convergence test.
- Set the "epsilon" to be .
- If cost decreases by in one iteration, declare convergence.
- It is hard to find an proper "epsilon", so we'd better choose the left strategy rather than right one.
5.4.3 Learning Rate Choose
We make sure that the cost should decrease after every iteration.

There's a strategy:
- First choose a very small , to see if the cost decrease after every iteration.
- Gradually increase the to see the cost curve's change.

5.4.4 Feature Engineering
Using intuition to design new features, by transforming or combining original features.

6 Polynomial Regression

Here, feature scaling is quite important.
How to choose different features and different models?

【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· 没有Manus邀请码?试试免邀请码的MGX或者开源的OpenManus吧
· 【自荐】一款简洁、开源的在线白板工具 Drawnix
· 园子的第一款AI主题卫衣上架——"HELLO! HOW CAN I ASSIST YOU TODAY