【RL】第6课-随机近似与随机梯度下降-

第6课-随机近似与随机梯度下降

6.1 Motivating examples

Mean Estimation

Revisit the mean estimation problem:

Consider a random variable \(X\).
Our aim is to estimate \(\mathbb{E}[X]\).
Suppose that we collected a sequence of iid samples \(\left\{x_i\right\}_{i=1}^N\).
The expectation of \(X\) can be approximated by

\[\mathbb{E}[X] \approx \bar{x}:=\frac{1}{N} \sum_{i=1}^N x_i . \]

采样N次，把所有数据收集起来求平均

We already know from the last lecture:

This approximation is the basic idea of Monte Carlo estimation.
We know that \(\bar{x} \rightarrow \mathbb{E}[X]\) as \(N \rightarrow \infty\). \(\bar{x}\)会逐渐趋近真实值
Why do we care about mean estimation so much?
Many values in RL such as state/action values are defined as means. 这些均值需要用数据去估计

迭代计算均值

incremental and iterative manner? 来几个就先计算几个，效率更高

假设：

\[w_{k+1}=\frac{1}{k} \sum_{i=1}^k x_i, \quad k=1,2, \ldots \]

可以得到：

\[w_{k+1}=w_k-\frac{1}{k}\left(w_k-x_k\right) \]

\[w_k \rightarrow \mathbb{E}[X] \text { as } k \rightarrow \infty \]

6.2 Robbins-Monro algorithm

Stochastic approximation (SA)

SA refers to a broad class of stochastic iterative algorithms solving root finding or optimization problems.
Compared to many other root-finding algorithms such as
gradient-based methods, SA is powerful in the sense that it does not require to know the expression of the objective function nor its derivative.

Problem statement

Suppose we would like to find the root of the equation

\[g(w)=0, \]

where \(w \in \mathbb{R}\) is the variable to be solved and \(g: \mathbb{R} \rightarrow \mathbb{R}\) is a function.

Many problems can be eventually converted to this root finding problem. For example, suppose \(J(w)\) is an objective function to be minimized. Then, the optimization problem can be converged to

\[g(w)=\nabla_w J(w)=0 \]

梯度为0

Note that an equation like \(g(w)=c\) with \(c\) as a constant can also be converted to the above equation by rewriting \(g(w)-c\) as a new function.

RM算法

求解\(g(w)=0\)的问题

The Robbins-Monro (RM) algorithm can solve this problem:

\[w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right), \quad k=1,2,3, \ldots \]

where

\(w_k\) is the \(k\) th estimate of the root
\(\tilde{g}\left(w_k, \eta_k\right)=g\left(w_k\right)+\eta_k\) is the \(k\) th noisy observation
\(a_k\) is a positive coefficient.
The function \(g(w)\) is a black box! This algorithm relies on data:
Input sequence: \(\left\{w_k\right\}\)
Noisy output sequence: \(\left\{\tilde{g}\left(w_k, \eta_k\right)\right\}\)

Stochastic gradient descent (SGD) algorithms

Suppose we aim to solve the following optimization problem:

\[\min _w \quad J(w)=\mathbb{E}[f(w, X)] \]

Method 1: gradient descent (GD)

\[w_{k+1}=w_k-\alpha_k \nabla_w \mathbb{E}\left[f\left(w_k, X\right)\right]=w_k-\alpha_k \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] \]

Drawback: the expected value is difficult to obtain.

Method 2: batch gradient descent (BGD)

\[\begin{gathered} \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] \approx \frac{1}{n} \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right) \\ w_{k+1}=w_k-\alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right) \end{gathered} \]

Drawback: it requires many samples in each iteration for each wk.

Method 3: stochastic gradient descent (SGD)

\[w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right) \]

Compared to the gradient descent method: Replace the true gradient \(\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\) by the stochastic gradient \(\nabla_w f\left(w_k, x_k\right)\).
Compared to the batch gradient descent method: let \(n=1\).

posted @ 2023-08-12 16:32 鸽鸽的书房阅读(41) 评论(0) 编辑收藏举报

刷新页面返回顶部

鸽鸽的书房

端庄厚重，谦卑含容；戒骄戒惰，但求有恒。