人工智能：原理与技术学习笔记

Lecture 2

Supervised learning: regression, classification, ...
Unsupervised learning: clustering, dimensionality reduction, ...
The canonical machine learning problem: Given a set of training data \(\{(x_i,y_i)\}_{i=1}^m\) and a loss function \(\ell\), find the parameters \(\theta\) that minimizes the sum of losses \(f(\theta)=\sum\limits_{i=1}^m\ell(h_{\theta}(x_i),y_i)\).
Linear least squares problem: linear hypothesis function \(h_{\theta}(x)=\theta^Tx\), and \(\ell(h_{\theta}(x),y)=(h_{\theta}(x)-y)^2\).

Solutions for linear least squares problem:
- Gradient descent: Repeat \(\theta\leftarrow\theta-\alpha\nabla_{\theta}f(\theta)\).
- Analytical method: Let \(X=\begin{bmatrix}x_1^T\\x_2^T\\\vdots\\x_m^T\end{bmatrix},y=\begin{bmatrix}y_1\\y_2\\\vdots\\y_m\end{bmatrix}\), then \(f(\theta)=\lVert X\theta-y\rVert^2\), \(\nabla_{\theta}f=2X^T(X\theta-y)\), solve the equation \(2X^T(X\theta-y)=0\), we get \(\theta=(X^TX)^{-1}X^Ty\).
Linear regression: the hypothesis function to fit training data is linear to the parameters when basing on particular input features.
Linear classification (for \(2\) classes): the hypothesis function is linear to the parameters when basing on particular input features. And the true output is determined by the sign of the hypothesis function. (\(\hat{y}=\text{sign}(h_{\theta}(x)))\).
Loss functions for classification:
- \(\ell_{0/1}:\ell(h_{\theta}(x),y)=[\text{sign}(h_{\theta})=\text{sign}(y)]\) (NP-hard to solve)
- \(\ell_{\text{logistic}}:\ell(h_{\theta}(x),y)=\log(1+\exp(-y·h_{\theta}(x)))\)
- \(\ell_{\text{hinge}}:\ell(h_{\theta}(x),y)=\max(1-y·h_{\theta}(x),0)\)
- \(\ell_{\text{exp}}:\ell(h_{\theta}(x),y)=\exp(-y·h_{\theta}(x))\).
Typically no closed-formed solution, solvable by gradient descent.
Support vector machine: solves the canonical machine learning optimization problem using hinge loss and linear hypothesis, i.e. \(f(\theta)=\sum\limits_{i=1}^m\max(1-y_i·\theta^Tx_i,0)\).
Logistic regression: solves the canonical machine learning optimization problem using logistic loss and linear hypothesis, i.e. \(f(\theta)=\sum\limits_{i=1}^m\log(1+\exp(-y_i·\theta^Tx_i))\).
Multiclass classification: Build \(k\) different classifiers \(h_{\theta_i}\) and output prediction as \(\operatorname{argmax}_ih_{\theta_i}(x)\). And loss function is defined as \(\ell(h_{\theta}(x),y)=-\log \dfrac{e^{h_{\theta_y}(x)}}{\sum_{i=1}^me^{h_{\theta_i}(x)}}=\log \sum_{i=1}^me^{h_{\theta_i}(x)}-h_{\theta_y}(x)\). (called softmax loss or cross entropy loss).
Overfitting: As the model becomes more complex, training loss always decreases, generalization loss decreases to a point then starts to increase.
Cross-validation: Divide the data set into training set and holdout, use training set to determine the parameters, use holdout/validation set to determine the hyperparameters (degree of polynomial, \(\lambda\) in regularization, ...).
Regularization: add a term \(\dfrac{\lambda}{2}\lVert\theta\rVert_2^2\) to the loss function \(f(\theta)\), where \(\lambda\) is a hyperparameter (when the model becomes complex, the parameters tend to be large in order to overfit the training data).
Even though we used a training/holdout split to fit the parameters, we are still effectively fitting the hyperparameters to the holdout set. Use a test set to evaluate the performance. And the best solutions are: evaluate your system “in the wild” as often a possible, recollect data if you suspect overfitting to present data, ...

Lecture 3

Neural network: composed non-linear functions.
Elements of a neural network: weight \(w\), bias \(b\), activate function \(f\) (\(z_i=f_i(W_iz_{i-1}+b_i))\).
Common activate functions:
- Sigmoid: \(f(x)=\dfrac{1}{1+e^{-x}}\).
- Rectified linear unit (ReLU): \(f(x)=\max(x,0)\).
- Hyperbolic tangent: \(f(x)=\tanh(x)=\dfrac{e^{2x}-1}{e^{2x}+1}\).
Stochastic Gradient Descent (SGD): adjust parameters based upon just one random sample (or a small random collection of samples, called batch), i.e. \(\theta\gets \theta-\alpha\nabla_{\theta}\ell(h_\theta(x_i),y_i)\) for some random \(i\).
Backpropagation algorithm: Use the chain rule to recursively calculate the partial derivative of the loss function to every parameter, from the last layer to the first.
Momentum: \(v_t=\beta v_{t-1}+(1-\beta)\nabla_{\theta}f(\theta)\) and \(\theta\gets \theta-\eta v_t\). Usually, \(\beta=0.9\).

Lecture 4

Problems with fully connected networks: need a very large number of parameters, very likely to overfit the data; generic deep network also does not capture the “natural” invariances we expect in images (translation, scale).
Convolutional Neural Networks has 4 types of layers:
1. Convolution: require that activations between layers only occur in “local" manner; require that all activations share the same weights.
2. Non-linearity: Rectified Linear Unit (ReLU). Advantages: 1. Fast to compute; 2. No Cancellation problem; 3. More sparse activation volume; 4. Solving the vanishing gradient problem
3. Pooling (or downsampling):
  - Pick a window size.
  - Pick a stride.
  - Walk the window across the image.
  - For each window, take the maximum value.
4. Fully Connected Layer
Recurrent Neural Network: a type of neural network with memory, the output of hidden layer are stored in the memory, and the memory can be considered as another input, used to deal with sequential data.
Problems with Vanilla RNNs:
- exploding gradient \(\to\) gradient clipping
- vanishing gradient \(\to\) change RNN architecture \(\to\) LSTM

Lecture 5

A search problem consists of:
- State space \(S\)
- Start state \(s_{start}\in S\)
- Possible actions \(\text{Actions}(s)\)
- Successor: \(\text{Succ}(s,a)\)
- Action cost \(\text{Cost}(s,a)\geq 0\)
- Goal test \(\text{IsEnd}(s)\).
A solution is a sequence of actions (a plan) which transforms the start state to a goal state.
Search heuristic: A function \(h(x)\) that estimates how close a state is to a goal.
Uninformed cost search: \(h(x)=0\). Informed: Can know whether one non-goal is more promising than another.
Admissible heuristic: \(h(x)\leq \text{the true cost to a nearest goal}\). If \(h\) is admissible, A* is optimal (can find the min-cost solution) and also optimally efficient (expand minimum nodes, which means if you expand nodes less than A* expands, you can't ensure your solution is the optimal solution).
A* tree search: Expand the nodes available in the increasing order of \(f(x)+h(x)\), where \(f(x)\) is the least cost from the starting state to \(x\). The algorithm is optimal if the heuristic is admissible. But the time complexity of A* Tree Search can be exponential.
Consistent heuristic: \(h(x)-h(y)\leq \text{cost}(x\to y)\). If \(h\) is consistent, then at the first time expanding some state, we have obtained the shortest path to the state. So every state is expanded at most once.
A* graph search: Similar to A* tree search, but guarantees that each node is expanded once. The algorithm is optimal if the heuristic is consistent.

Lecture 6

An Markov Decision Problem (MDP) consists of: state space \(S\), start state, actions \(a\), transition function \(T(s,a,s')=P(s'|s,a)\), reward function \(R(s)\) (sometimes \(R(s,a)\) or \(R(s,a,s')\)), (maybe) terminal state.
A solution to an MDP is a policy:
- Non-stationary policy: function from states and times to actions.
- Stationary policy: mapping from states to actions.
- Both non-stationary policy and stationary policy satisfies the following property:
  - Full observability of the state
  - History-independence
  - Deterministic action choice
Infinite Horizon Discounted Value: \(V^{\pi}(s)=E\left[\sum_{t=0}^{\infty}\gamma^tR(s_t)\right]\). It follows Bellman Equation: \(V^{\pi}(s)=R(s)+\gamma\sum_{s'\in S}P(s'|s,\pi(s))V^{\pi}(s')\).
Value iteration: repeatedly perform Bellman backup operator \(B:\mathbb R^{|S|}\to\mathbb R^{|S|}\): \(BV(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V(s)\). The operator is a contraction, that for any \(V_1,V_2\) So there is unique policy \(V^*\), with property \(V^*=BV^*\).

\[\begin{aligned} \lVert BV_1-BV_2\rVert_\infty&=\gamma\max_{s\in S}\left|\left(\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V_1(s)\right)-\left(\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V_2(s)\right)\right|\\ &\leq\gamma\max_{s\in S}\max_{a\in A}\left|\sum_{s'\in S}P(s'|s,a)(V_1(s)-V_2(s))\right|\\ &\leq\gamma\max_{s\in S}\max_{a\in A}\sum_{s'\in S}P(s'|s,a)\left|V_1(s)-V_2(s)\right|\\ &=\gamma\max_{s\in S}\max_{a\in A}|V_1(s)-V_2(s)|\\ &=\gamma\max_{s\in S}|V_1(s)-V_2(s)|\\ &=\gamma\lVert V_1-V_2\rVert_\infty \end{aligned} \]
So \(V^*(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s'|s,a)V^*(s')\). The policy is optimal: by induction we can prove that \(V^*(s)\geq BV(s)\geq V(s)\) if we start from any \(V^{\pi}\).
The convergence rate of value iteration: Assume rewards in \([0,R_{max}]\), then \(V*(s)\le\sum\limits_{t=0}^{\infty}\gamma^tR_{max}=\dfrac{R_{max}}{1-\gamma}\), so we can prove that \(\max\limits_{s\in S}|V^k(s)-V^*(s)|\le\dfrac{\gamma^k R_{max}}{1-\gamma}\) by induction. In other words, we have linear convergence to optimal value function.
Stopping Condition: Since \(\lVert V^k-V^{k-1}\rVert_{\infty}\le\epsilon\Rightarrow\lVert V^k-V^*\rVert_{\infty}\le\epsilon\dfrac{\gamma}{1-\gamma}\), we can continue iteration until \(\lVert V^k-V^{k-1}\rVert_{\infty}\le\epsilon\) for some small constant \(\epsilon\).
Policy iteration: \(\pi^{k+1}(s)\gets \operatorname{argmax}_a\sum_{s'\in S}P(s'|s,a)V^{\pi^k}(s)\) and \(V^{\pi^k}\) can be calculated by solving a linear system.
Modified Policy Iteration: using Bellman update with repeating \(k\) times for policy evaluation (instead of solving a linear system).

Lecture 7

Reinforcement Learning: Don't know transition function and reward function in an MDP.
Model-based approach to RL: Learn the MDP model, or an approximation of it, and use it for policy evaluation or to find the optimal policy

Model-free approach to RL: derive the optimal policy without explicitly learning the model
Passive learning: The agent has a fixed policy and tries to learn the utilities of states by observing the world go by.
- Approach 1 (model-free): Monte-Carlo direct estimation: Directly estimate \(V^{\pi}(s)\) as average total reward of episodes containing \(s\).
- Approach 2 (model-based): Adaptive Dynamic Programming (ADP): Estimate \(P(s'|s,a)\) by sampling \(s'\), then use estimated model to compute utility of policy.
- Approach 3 (model-free): Temporal Difference Learning (TD). After we take an action from \(s\) to \(s'\), \(V^\pi(s)\gets V^\pi(s)+\alpha(R(s)+\gamma V^{\pi}(s')-V^\pi(s))\). Intuition: \(\frac{\sum_{i=1}^{n+1}x_i}{n+1}=\frac{\sum_{i=1}^nx_i}{n}+\frac{1}{n+1}\left(x_{n+1}-\frac{\sum_{i=1}^nx_i}{n}\right)\). If we use decreasing learning rate (e.g. \(\dfrac{1}{n}\)) the estimation will converge to true value.
Active learning: The agent does not have a policy.
- ADP-based RL: Start with an initial model, solve for the optimal policy given the current model (using value or policy iteration), take an action according to an exploration/exploitation policy, and update the estimated model based on an observed transition.
- TD-based RL: Start with initial value function, take action from an exploration/exploitation policy giving new state \(s'\) (should converge to greedy policy), update estimated model, perform TD update \(V(s)\gets V(s)+\alpha(R(s)+\gamma V(s')-V(s))\).
- Q-learning (\(Q(s, a)\) is the expected value of taking action a in state s and then following the optimal policy thereafter): Start with initial Q-function, take action from an exploration/exploitation policy giving new state \(s'\), perform TD update \(Q(s,a)\gets Q(s,a)+\alpha(R(s)+\gamma\max_{a'}Q(s',a')-Q(s,a))\).
Exploration policy: we would like an exploration policy that is greedy in the limit of infinite exploration (GLIE).
- GLIE Policy 1: select to act randomly or greedily with probability regarding to \(t\).
- GLIE Policy 2: Boltzmann Exploration. \(\text{Pr}(a|s)=\frac{\exp(Q(s,a)/T)}{\sum_{a'\in A}\exp(Q(s,a')/T)}\). \(T\) is the temperature. Large \(T\) means that each action has about the same probability. Small \(T\) leads to more greedy behavior. Typically start with large \(T\) and decrease with time.

Lecture 8

Linear function approximation: Define a set of state features \(f_1(s),f_2(s),\cdots,f_n(s)\), and let \(V_{\theta}(s)=\theta_0+\theta_1f_1(s)+\cdots+\theta_nf_n(s)\). We can learn the optimal parameter \(\theta\) by gradient descent. Since the loss function is written in the form \(E(\theta)=\dfrac{1}{2}\sum\limits_{s\in S}(V_{\theta}(s)-v(s))^2\), we can choose some \(s\in S\) and perform \(\theta\leftarrow \theta-\alpha\dfrac{\partial E_{s}(\theta)}{\partial\theta}\) to derive the optimal \(\theta\).
Q-function approximation: Let \(Q_{\theta}(s,a)=\theta_0+\theta_1f_1(s,a)+\cdots+\theta_nf_n(s,a)\). Similar to Q-learning, perform \(\theta_i\leftarrow \theta_i-\alpha(R(s)+\gamma\max_{\theta'}Q_{\theta}(s',a')-Q_{\theta}(s,a))f_i(s,a)\) after taking a GLIE.
Deep Q Learning (DQN): use neural network to represent function and learn the parameters in

neural network. But still need some method to improve stability (experience replay, fixed target, reward range clip)
Double Q learning: Train \(2\) action-value functions, \(Q_1\) and \(Q_2\), do Q-learning on both, but
- Never on the same time steps.
- Pick \(Q_1\) or \(Q_2\) at random to be updated on each step.
- When we update \(Q_1\), use \(Q_2\) for the value of the next step, i.e., \(Q_1(s,a)\leftarrow Q_1(s,a)+\alpha(R+Q_2(s',\arg\max_{a'} Q_1(s',a'))-Q_1(s,a))\).
Maximization bias for \(Q\)-learning: After Q-learning, we will have an estimation \(\hat{Q}\) for real \(Q\), we simply see \(\hat{Q}\) as a random variable satisfying \(E[\hat{Q}(s,a)]=Q(s,a)\), then we will have \(E[\max_{a}\hat{Q}(s,a)]\ge\max_aQ(s,a)\), that is we have an overestimation of \(V(s)\).
Double Q learning solves maximization bias: If \(Q_1,Q_2\) are two independent estimations of \(Q\) such that \(E[Q_1(s,a)]=E[Q_2(s,a)]=Q(s,a)\), and we estimate \(\hat{V}(s)=Q_2(s,\arg\max_aQ_1(s,a))\), then \(E[\hat{V}(s)]\le\max_aQ(s,a)\).

Lecture 9

Policy learning: directly learn the \(\pi(s,a)\) for each \(s,a\). View it as a function \(J(\theta)\) and do gradient descent on \(\theta\).
Actually we can calculate the policy gradient analytically:

\[\begin{aligned} &\nabla_{\theta}V^{\pi}(s)\\ =&\nabla_{\theta}(\sum\limits_{a\in A}\pi_{\theta}(a|s)Q^{\pi}(s,a))\\ =&\sum\limits_{a\in A}(\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)+\pi_{\theta}(a|s)\nabla_{\theta}Q^{\pi}(s,a))\\ =&\sum\limits_{a\in A}(\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)+\pi_{\theta}(a|s)\nabla_{\theta}Q^{\pi}(s,a))\\ =&\sum\limits_{a\in A}(\nabla_{\theta}\pi_{\theta}(a|s)Q^{\pi}(s,a)+\pi_{\theta}(a|s)\sum\limits_{s'}P(s'|s,a)\nabla_{\theta}V^{\pi}(s')) \end{aligned} \]
By continue expanding \(V^{\pi}(s')\) for infinite layers, we may obtain

\[\begin{aligned} &\nabla_{\theta}V^{\pi}(s)\\ =&\sum\limits_{x\in S}\sum\limits_{k=0}^{\infty}p(s\to x,k)\sum\limits_{a}\nabla_{\theta}\pi_{\theta}(a|x)Q^{\pi}(x,a)\\ =&\sum\limits_{x\in S}d^{\pi}(x)\sum\limits_{a}\nabla_{\theta}\pi_{\theta}(a|x)Q^{\pi}(x,a)\\ =&\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(s,a)Q^{\pi}(s,a)] \end{aligned} \]
where \(p(s\to x,k)\) is the probability that the agent reaches \(x\) from \(s\) in \(k\) steps, and \(d^{\pi}(x)=\dfrac{\sum\limits_{k=0}^{\infty}p(s\to x,k)}{\sum\limits_{y\in S}\sum\limits_{k=0}^{\infty}p(s\to y,k)}\) is the stationary distribution.

According to this, we may sample some random \((s,a)\) in a process and calculate the gradient by calculating their average.
Q Actor-Critic: In order to decrease the variance of sampling, we use a critic (new parameter vector \(w\)) to estimate the \(Q\) function, in every step, we update \(\pi\) by gradient descent, and update \(w\) by TD.
Reducing variance using a baseline: We subtract a baseline function \(B(s)\) from the policy gradient, this will not change the expectation since

\[\begin{aligned} \mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(s,a)B(s)]&=\sum\limits_{s\in S}d^{\pi_{\theta}}(s)\sum\limits_{a}\nabla_{\theta}\pi_{\theta}(s,a)B(s)\\ &=\sum\limits_{s\in S}d^{\pi_{\theta}}(s)B(s)\nabla_{\theta}(\sum\limits_{a}\pi_{\theta}(s,a))\\ &=0 \end{aligned} \]
Let \(B(s)=V^{\pi_{\theta}}(s)\) will decrease the variance of sampling.

Lecture 10

Monte Carlo Tree Search (MCTS) (for deterministic environment): close-loop planning:
- Selection: start at root, recursively select child based on Tree Policy and descend until expandable node is reached.
- Expansion: add one or more child nodes to the tree.
- Simulation: simulate the remaining part of the game based on default policy (or rollout policy) and get an estimated future reward.
- Backpropagation.
Upper Confidence Bound on Trees (UCT): a tree policy. For current node with visited times \(n\), a child with value \(x_j\) and visited times \(n_j\) has UCT value \(x_j+2c_p\sqrt{\dfrac{2\ln n}{n_j}}\). Adjust \(c_p\) to change exploration VS exploitation tradeoff.

Lecture 11

Minimax search: A type of adversarial search, \(O(b^m)\).
Alpha-beta search: Improved version of minimax search, can reach \(O(b^{m/2})\) with perfect ordering.
Heuristic Minimax: depth-limited search, replace terminal utilities with a heuristic evaluation

function.

Lecture 12

A probability distribution over a random variable associates a probability with each value. For a random variable \(X\), we must have \(\forall x,P(X=x)\ge 0\), \(\sum\limits_{x}P(X=x)=1\).
Joint distribution: a distribution over a set of random variables.
Marginalization (summing out): Combine collapsed rows by adding (\(P(t)=\sum_s P(t,s)\)).
Conditional probability: \(P(a|b)=\dfrac{P(a,b)}{P(b)}\), where \(P(a|b)\) stands for the probability of \(a\) if it is known that \(b\) is true.
Bayes' rule: \(P(x,y)=P(x|y)P(y)=P(y|x)P(x)\), so \(P(x|y)=P(y|x)·\dfrac{P(x)}{P(y)}\).
Independence: two events \(x,y\) are independent if and only if \(P(x,y)=P(x)P(y)\), or alternatively \(P(x|y)=P(x)\).
Conditional independence: two events \(x,y\) are conditionally independent given \(z\) if and only if \(P(x,y|z)=P(x|z)P(y|z)\), or alternatively, \(P(x|y,z)=P(x|z)\).
Bayes’ net: a technique for describing complex joint distributions (models) using simple, local distributions (conditional probabilities). The semantics of a Bayes' net are:
- A set of nodes, one per random variable \(X\).
- A directed, acyclic graph.
- A conditional distribution for each node.
Given a full assignment \(x_1,\cdots,x_n\) of Bayes' Net, \(P(x_1,\cdots,x_n)=\prod_{i=1}^n P(x_i|x_1,\cdots,x_{i-1})=\prod_{i=1}^nP(x_i|\text{parents}(X_i))\).
Independence in BN: Given evidences \(E\), variable \(X,Y\) are dependent (D-connected) if and only if there exists a path between \(X\) and \(Y\) consisting of active triples, where a triple is active or not depends on its middle node.
BN inference: To calculate \(P(Q|E_1=e_1,E_2=e_2,\cdots,E_k=e_k)\), we repeatedly pick a hidden variable \(H\), join all factors mentioning \(H\), eliminate (sum out) \(H\) until there are no hidden variables. (a hidden variable is a variable not in \(Q\) or \(E\)).
Bn inference is NP-hard.

Lecture 13

Approximate inference of Bayesian nets:

Prior Sampling: ignore evidence and sample from the joint probability, count the number of right samples.
Rejection Sampling: sample reject if the newly sampled variable is not consistent with evidence.
Likelihood Weighting: fix evidence variables, sample only non-evidence variables and weigh each sample by the likelihood it accords the evidence.
Gibbs Sampling (A Markov Chain Monte Carlo (MCMC) method): first fix evidence and randomly initialize non-evidence, then repeatedly iterate every non-evidence variable \(X_i\) and resample it according to \(\Pr[X_i|\text{all other variables}]\).

\(\Pr[X_1,\cdots,X_n|E]\) is stationary under resampling \(X_i\), assume that \(X_1,X'_2,\cdots,X'_n\) is the sequence we obtain after resampling \(X_i\), then

\[\begin{aligned} &\Pr[X'_1=c_1,\cdots,X'_n=c_n|E]\\ =&\Pr[X'_i=c_i|\text{all other variables}]\Pr[\cdots,X'_{i-1}=c_{i-1},X'_{i+1}=c_{i+1},\cdots|E]\\ =&\Pr[X_i=c_i|\text{all other variables}]\Pr[\cdots,X_{i-1}=c_{i-1},X_{i+1}=c_{i+1},\cdots|E]\\ =&\Pr[X_1=c_1,\cdots,X_n=c_n|E] \end{aligned} \]
, where \(\Pr[X'_i=c_i|\text{all other variables}]=\Pr[X_i=c_i|\text{all other variables}]\) because we sample \(X'_i\) according to distribution \(\Pr[X_i=c_i|\text{all other variables}]\), and \(\Pr[\cdots,X'_{i-1}=c_{i-1},X'_{i+1}=c_{i+1},\cdots|E]=\Pr[\cdots,X_{i-1}=c_{i-1},X_{i+1}=c_{i+1},\cdots|E]\) because we directly set \(\cdots,X'_{i-1}=X_{i-1},X'_{i+1}=X_{i+1},\cdots\).

Lecture 14

Hidden Markov Models: underlying Markov chain over states \(X\), the agent observes outputs (effects) at each time step.
Dynamic Bayes Nets (DBNs): HMMs can be seen as a BN with specific structure. We can generalize it to Dynamic Bayes Nets. To obtain \(P(X_T|e_{1:T})\), we can apply variable elimination.
MLE (most likely explanation) Queries (\(\arg\max_{x_{1:t}} P(x_{1:t}|e_{1:t})\)) for HMMs: the Viterbi algorithm. \(m_t[x_t]=\max_{x_{1:t-1}}P(x_{1:t-1},x_t,e_{1:t})=\max_{x_{t-1}}P(e_t,x_t|x_{t-1})m_{t-1}[x_{t-1}]\).

posted @ 2024-11-15 22:32 tzc_wk 阅读(82) 评论(2) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

公告

昵称： tzc_wk
园龄： 4年9个月
粉丝： 188
关注： 40

2025年2月

日

一

二

三

四

五

六

tzc_wk

人工智能：原理与技术 学习笔记

Lecture 2

Lecture 3

Lecture 4

Lecture 5

Lecture 6

Lecture 7

Lecture 8

Lecture 9

Lecture 10

Lecture 11

Lecture 12

Lecture 13

Lecture 14

公告

人工智能：原理与技术学习笔记