1 Introduction

  • 2006, Deep Belief Networks, RBM pre-train, gradient descent finetune
  • energy-based, MRF, dependency structure between random variables
  • classification and representational learning

2 Classical Restricted Boltzmann Machines

  • generative stochastic network, a layer of visible, a layer of hidden, parameters
  • \(v,v_i,D;h,h_j,J\), \(P_{data}(v), P_{model}(v;\theta)\)
    • \(\theta = \{W,b,c\}\)
    • \(E(v,h;\theta)\): the \(v\)-th label "relating to" the \(h\)-th hidden
    • \(P(v,h;\theta)\), \(P(v;\theta)\)
    • e.g. \(v_i\) is activated? (\(v_i=1\))
      • \(P(v_i = 1|h;\theta) = \frac{\sum_{v,v_i=1} exp(\sum_i\sum_jv_iW_{ij}h_j+\sum_jb_jh_j+\sum_ic_iv_i)}{\sum_v exp(\sum_i\sum_jv_iW_{ij}h_j+\sum_jb_jh_j+\sum_ic_iv_i)}\)
      • PoE (Products of Experts)
        in fact, let \(i=j=2\), then:
        \(\sum_v exp(\sum_i\sum_jv_iW_{ij}h_j+\sum_i c_i v_i)\)
        \(=1 + exp(\sum_j W_{1j}h_j + c_1) + exp(\sum_j W_{2j}h_j + c_2) + exp(\sum_j W_{1j}h_j + c_1+\sum_j W_{2j}h_j + c_2)\)
        \(=(1+exp(\sum_j W_{1j} h_j + c_1))(1+exp(\sum_j W_{2j}h_j+c_2))\)
      • then \(P(v_i=1|h)=\sigma(\sum_j W_{ij}h_j+c_i)\)