

6.1 Motivating examples

Mean Estimation

Revisit the mean estimation problem:

  • Consider a random variable \(X\).
  • Our aim is to estimate \(\mathbb{E}[X]\).
  • Suppose that we collected a sequence of iid samples \(\left\{x_i\right\}_{i=1}^N\).
  • The expectation of \(X\) can be approximated by

\[\mathbb{E}[X] \approx \bar{x}:=\frac{1}{N} \sum_{i=1}^N x_i . \]


We already know from the last lecture:

  • This approximation is the basic idea of Monte Carlo estimation.
  • We know that \(\bar{x} \rightarrow \mathbb{E}[X]\) as \(N \rightarrow \infty\). \(\bar{x}\)会逐渐趋近真实值
    Why do we care about mean estimation so much?
  • Many values in RL such as state/action values are defined as means. 这些均值需要用数据去估计


incremental and iterative manner? 来几个就先计算几个,效率更高


\[w_{k+1}=\frac{1}{k} \sum_{i=1}^k x_i, \quad k=1,2, \ldots \]


\[w_{k+1}=w_k-\frac{1}{k}\left(w_k-x_k\right) \]

\[w_k \rightarrow \mathbb{E}[X] \text { as } k \rightarrow \infty \]

6.2 Robbins-Monro algorithm

Stochastic approximation (SA)

  • SA refers to a broad class of stochastic iterative algorithms solving root finding or optimization problems.
  • Compared to many other root-finding algorithms such as
    gradient-based methods, SA is powerful in the sense that it does not require to know the expression of the objective function nor its derivative.

Problem statement

Suppose we would like to find the root of the equation

\[g(w)=0, \]

where \(w \in \mathbb{R}\) is the variable to be solved and \(g: \mathbb{R} \rightarrow \mathbb{R}\) is a function.

  • Many problems can be eventually converted to this root finding problem. For example, suppose \(J(w)\) is an objective function to be minimized. Then, the optimization problem can be converged to

\[g(w)=\nabla_w J(w)=0 \]


  • Note that an equation like \(g(w)=c\) with \(c\) as a constant can also be converted to the above equation by rewriting \(g(w)-c\) as a new function.



The Robbins-Monro (RM) algorithm can solve this problem:

\[w_{k+1}=w_k-a_k \tilde{g}\left(w_k, \eta_k\right), \quad k=1,2,3, \ldots \]


  • \(w_k\) is the \(k\) th estimate of the root
  • \(\tilde{g}\left(w_k, \eta_k\right)=g\left(w_k\right)+\eta_k\) is the \(k\) th noisy observation
  • \(a_k\) is a positive coefficient.
    The function \(g(w)\) is a black box! This algorithm relies on data:
  • Input sequence: \(\left\{w_k\right\}\)
  • Noisy output sequence: \(\left\{\tilde{g}\left(w_k, \eta_k\right)\right\}\)

Stochastic gradient descent (SGD) algorithms

Suppose we aim to solve the following optimization problem:

\[\min _w \quad J(w)=\mathbb{E}[f(w, X)] \]

Method 1: gradient descent (GD)

\[w_{k+1}=w_k-\alpha_k \nabla_w \mathbb{E}\left[f\left(w_k, X\right)\right]=w_k-\alpha_k \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] \]

Drawback: the expected value is difficult to obtain.

Method 2: batch gradient descent (BGD)

\[\begin{gathered} \mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right] \approx \frac{1}{n} \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right) \\ w_{k+1}=w_k-\alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f\left(w_k, x_i\right) \end{gathered} \]

Drawback: it requires many samples in each iteration for each wk.

Method 3: stochastic gradient descent (SGD)

\[w_{k+1}=w_k-\alpha_k \nabla_w f\left(w_k, x_k\right) \]

  • Compared to the gradient descent method: Replace the true gradient \(\mathbb{E}\left[\nabla_w f\left(w_k, X\right)\right]\) by the stochastic gradient \(\nabla_w f\left(w_k, x_k\right)\).
  • Compared to the batch gradient descent method: let \(n=1\).
