人工智能:原理与技术 学习笔记
Lecture 2
-
Supervised learning: regression, classification, ...
-
Unsupervised learning: clustering, dimensionality reduction, ...
-
The canonical machine learning problem: Given a set of training data and a loss function , find the parameters that minimizes the sum of losses .
-
Linear least squares problem: linear hypothesis function , and .
Solutions for linear least squares problem:
- Gradient descent: Repeat .
- Analytical method: Let , then , , solve the equation , we get .
-
Linear regression: the hypothesis function to fit training data is linear to the parameters when basing on particular input features.
-
Linear classification (for classes): the hypothesis function is linear to the parameters when basing on particular input features. And the true output is determined by the sign of the hypothesis function. (.
-
Loss functions for classification:
- (NP-hard to solve)
- .
Typically no closed-formed solution, solvable by gradient descent.
-
Support vector machine: solves the canonical machine learning optimization problem using hinge loss and linear hypothesis, i.e. .
-
Logistic regression: solves the canonical machine learning optimization problem using logistic loss and linear hypothesis, i.e. .
-
Multiclass classification: Build different classifiers and output prediction as . And loss function is defined as . (called softmax loss or cross entropy loss).
-
Overfitting: As the model becomes more complex, training loss always decreases, generalization loss decreases to a point then starts to increase.
-
Cross-validation: Divide the data set into training set and holdout, use training set to determine the parameters, use holdout/validation set to determine the hyperparameters (degree of polynomial, in regularization, ...).
-
Regularization: add a term to the loss function , where is a hyperparameter (when the model becomes complex, the parameters tend to be large in order to overfit the training data).
-
Even though we used a training/holdout split to fit the parameters, we are still effectively fitting the hyperparameters to the holdout set. Use a test set to evaluate the performance. And the best solutions are: evaluate your system “in the wild” as often a possible, recollect data if you suspect overfitting to present data, ...
Lecture 3
- Neural network: composed non-linear functions.
- Elements of a neural network: weight , bias , activate function (.
- Common activate functions:
- Sigmoid: .
- Rectified linear unit (ReLU): .
- Hyperbolic tangent: .
- Stochastic Gradient Descent (SGD): adjust parameters based upon just one random sample (or a small random collection of samples, called batch), i.e. for some random .
- Backpropagation algorithm: Use the chain rule to recursively calculate the partial derivative of the loss function to every parameter, from the last layer to the first.
- Momentum: and . Usually, .
Lecture 4
-
Problems with fully connected networks: need a very large number of parameters, very likely to overfit the data; generic deep network also does not capture the “natural” invariances we expect in images (translation, scale).
-
Convolutional Neural Networks has 4 types of layers:
- Convolution: require that activations between layers only occur in “local" manner; require that all activations share the same weights.
- Non-linearity: Rectified Linear Unit (ReLU). Advantages: 1. Fast to compute; 2. No Cancellation problem; 3. More sparse activation volume; 4. Solving the vanishing gradient problem
- Pooling (or downsampling):
- Pick a window size.
- Pick a stride.
- Walk the window across the image.
- For each window, take the maximum value.
- Fully Connected Layer
-
Recurrent Neural Network: a type of neural network with memory, the output of hidden layer are stored in the memory, and the memory can be considered as another input, used to deal with sequential data.
-
Problems with Vanilla RNNs:
- exploding gradient gradient clipping
- vanishing gradient change RNN architecture LSTM
Lecture 5
-
A search problem consists of:
- State space
- Start state
- Possible actions
- Successor:
- Action cost
- Goal test .
A solution is a sequence of actions (a plan) which transforms the start state to a goal state.
-
Search heuristic: A function that estimates how close a state is to a goal.
-
Uninformed cost search: . Informed: Can know whether one non-goal is more promising than another.
-
Admissible heuristic: . If is admissible, A* is optimal (can find the min-cost solution) and also optimally efficient (expand minimum nodes, which means if you expand nodes less than A* expands, you can't ensure your solution is the optimal solution).
-
A* tree search: Expand the nodes available in the increasing order of , where is the least cost from the starting state to . The algorithm is optimal if the heuristic is admissible. But the time complexity of A* Tree Search can be exponential.
-
Consistent heuristic: . If is consistent, then at the first time expanding some state, we have obtained the shortest path to the state. So every state is expanded at most once.
-
A* graph search: Similar to A* tree search, but guarantees that each node is expanded once. The algorithm is optimal if the heuristic is consistent.
Lecture 6
-
An Markov Decision Problem (MDP) consists of: state space , start state, actions , transition function , reward function (sometimes or ), (maybe) terminal state.
-
A solution to an MDP is a policy:
- Non-stationary policy: function from states and times to actions.
- Stationary policy: mapping from states to actions.
- Both non-stationary policy and stationary policy satisfies the following property:
- Full observability of the state
- History-independence
- Deterministic action choice
-
Infinite Horizon Discounted Value: . It follows Bellman Equation: .
-
Value iteration: repeatedly perform Bellman backup operator : . The operator is a contraction, that for any So there is unique policy , with property .
So . The policy is optimal: by induction we can prove that if we start from any .
-
The convergence rate of value iteration: Assume rewards in , then , so we can prove that by induction. In other words, we have linear convergence to optimal value function.
-
Stopping Condition: Since , we can continue iteration until for some small constant .
-
Policy iteration: and can be calculated by solving a linear system.
-
Modified Policy Iteration: using Bellman update with repeating times for policy evaluation (instead of solving a linear system).
Lecture 7
-
Reinforcement Learning: Don't know transition function and reward function in an MDP.
-
Model-based approach to RL: Learn the MDP model, or an approximation of it, and use it for policy evaluation or to find the optimal policy
Model-free approach to RL: derive the optimal policy without explicitly learning the model
-
Passive learning: The agent has a fixed policy and tries to learn the utilities of states by observing the world go by.
- Approach 1 (model-free): Monte-Carlo direct estimation: Directly estimate as average total reward of episodes containing .
- Approach 2 (model-based): Adaptive Dynamic Programming (ADP): Estimate by sampling , then use estimated model to compute utility of policy.
- Approach 3 (model-free): Temporal Difference Learning (TD). After we take an action from to , . Intuition: . If we use decreasing learning rate (e.g. ) the estimation will converge to true value.
-
Active learning: The agent does not have a policy.
- ADP-based RL: Start with an initial model, solve for the optimal policy given the current model (using value or policy iteration), take an action according to an exploration/exploitation policy, and update the estimated model based on an observed transition.
- TD-based RL: Start with initial value function, take action from an exploration/exploitation policy giving new state (should converge to greedy policy), update estimated model, perform TD update .
- Q-learning ( is the expected value of taking action a in state s and then following the optimal policy thereafter): Start with initial Q-function, take action from an exploration/exploitation policy giving new state , perform TD update .
-
Exploration policy: we would like an exploration policy that is greedy in the limit of infinite exploration (GLIE).
-
GLIE Policy 1: select to act randomly or greedily with probability regarding to .
-
GLIE Policy 2: Boltzmann Exploration. . is the temperature. Large means that each action has about the same probability. Small leads to more greedy behavior. Typically start with large and decrease with time.
-
Lecture 8
-
Linear function approximation: Define a set of state features , and let . We can learn the optimal parameter by gradient descent. Since the loss function is written in the form , we can choose some and perform to derive the optimal .
-
Q-function approximation: Let . Similar to Q-learning, perform after taking a GLIE.
-
Deep Q Learning (DQN): use neural network to represent function and learn the parameters in
neural network. But still need some method to improve stability (experience replay, fixed target, reward range clip)
-
Double Q learning: Train action-value functions, and , do Q-learning on both, but
- Never on the same time steps.
- Pick or at random to be updated on each step.
- When we update , use for the value of the next step, i.e., .
-
Maximization bias for -learning: After Q-learning, we will have an estimation for real , we simply see as a random variable satisfying , then we will have , that is we have an overestimation of .
-
Double Q learning solves maximization bias: If are two independent estimations of such that , and we estimate , then .
Lecture 9
-
Policy learning: directly learn the for each . View it as a function and do gradient descent on .
-
Actually we can calculate the policy gradient analytically:
By continue expanding for infinite layers, we may obtain
where is the probability that the agent reaches from in steps, and is the stationary distribution.
According to this, we may sample some random in a process and calculate the gradient by calculating their average.
-
Q Actor-Critic: In order to decrease the variance of sampling, we use a critic (new parameter vector ) to estimate the function, in every step, we update by gradient descent, and update by TD.
-
Reducing variance using a baseline: We subtract a baseline function from the policy gradient, this will not change the expectation since
Let will decrease the variance of sampling.
Lecture 10
- Monte Carlo Tree Search (MCTS) (for deterministic environment): close-loop planning:
- Selection: start at root, recursively select child based on Tree Policy and descend until expandable node is reached.
- Expansion: add one or more child nodes to the tree.
- Simulation: simulate the remaining part of the game based on default policy (or rollout policy) and get an estimated future reward.
- Backpropagation.
- Upper Confidence Bound on Trees (UCT): a tree policy. For current node with visited times , a child with value and visited times has UCT value . Adjust to change exploration VS exploitation tradeoff.
Lecture 11
-
Minimax search: A type of adversarial search, .
-
Alpha-beta search: Improved version of minimax search, can reach with perfect ordering.
-
Heuristic Minimax: depth-limited search, replace terminal utilities with a heuristic evaluation
function.
Lecture 12
-
A probability distribution over a random variable associates a probability with each value. For a random variable , we must have , .
-
Joint distribution: a distribution over a set of random variables.
-
Marginalization (summing out): Combine collapsed rows by adding ().
-
Conditional probability: , where stands for the probability of if it is known that is true.
-
Bayes' rule: , so .
-
Independence: two events are independent if and only if , or alternatively .
-
Conditional independence: two events are conditionally independent given if and only if , or alternatively, .
-
Bayes’ net: a technique for describing complex joint distributions (models) using simple, local distributions (conditional probabilities). The semantics of a Bayes' net are:
- A set of nodes, one per random variable .
- A directed, acyclic graph.
- A conditional distribution for each node.
-
Given a full assignment of Bayes' Net, .
-
Independence in BN: Given evidences , variable are dependent (D-connected) if and only if there exists a path between and consisting of active triples, where a triple is active or not depends on its middle node.
-
BN inference: To calculate , we repeatedly pick a hidden variable , join all factors mentioning , eliminate (sum out) until there are no hidden variables. (a hidden variable is a variable not in or ).
-
Bn inference is NP-hard.
Lecture 13
Approximate inference of Bayesian nets:
-
Prior Sampling: ignore evidence and sample from the joint probability, count the number of right samples.
-
Rejection Sampling: sample reject if the newly sampled variable is not consistent with evidence.
-
Likelihood Weighting: fix evidence variables, sample only non-evidence variables and weigh each sample by the likelihood it accords the evidence.
-
Gibbs Sampling (A Markov Chain Monte Carlo (MCMC) method): first fix evidence and randomly initialize non-evidence, then repeatedly iterate every non-evidence variable and resample it according to .
is stationary under resampling , assume that is the sequence we obtain after resampling , then
, where because we sample according to distribution , and because we directly set .
Lecture 14
-
Hidden Markov Models: underlying Markov chain over states , the agent observes outputs (effects) at each time step.
-
Dynamic Bayes Nets (DBNs): HMMs can be seen as a BN with specific structure. We can generalize it to Dynamic Bayes Nets. To obtain , we can apply variable elimination.
-
MLE (most likely explanation) Queries () for HMMs: the Viterbi algorithm. .
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具
2021-11-15 NFLSOJ #12369 - 「NOIP2021模拟赛0821杭二」序列(猫树分治)
2021-11-15 NFLSOJ 12365 - 「NOIP2021模拟赛0820南外」发怒(整除分块+点分治)
2020-11-15 洛谷 P3239 [HNOI2015]亚瑟王(期望+dp)