目录
- Predicting Structured Data
- 2006
- this paper: written by LeCun
- http://yann.lecun.com/exdb/publis/orig/lecun-06.pdf
- A tutorial on energy-based learning, which provides a common theoretical framework for many models.
1 Introduction: Energy-Based Models
- scalar energy to each configuration of the variable
- inference: set observed -> find remaining, minimize the energy
- learning: energy function, low energies to correct values
- loss functional: quality of energy functions
- probabilistic or non-probabilistic
- no normalization constant, flexible
1.1 Energy-Based Inference
- example: pixel -> label
- contrast functions, value functions, or negative log-likelihood functions, \(E(Y,X)\)
- discrete or continuous, dimension...
- all kinds of optimization techniques
1.2 What Questions Can a Model Answer?
- What is the Y that is most compatible with this X? predict, classify, decision-making
- ranking, detection (threshold), conditional probability (given to a human or another system)
- \(X\) high \(Y\) low (common); converse: image restoration, CG, generation; both high: complex! e.g. hyper-resolution
1.3 Decision Making versus Probabilistic Modeling
- uncalibrated, not commensurate, probability, Gibbs distribution, temperature, partition function, statistical physics
- convergence, restrict, intractable
2 Energy-Based Training: Architecture and Loss Function
- a family of energy functions indexed by a parameter \(W\)
- architecture, the internal structure of the parameterized function \(E(W, Y, X)\)
- real vectors, a linear combination of basis functions (kernel methods)
- neural
- training samples, prior knowledge, loss functional \(\mathcal L(E,\mathcal S)\), loss function, \(\mathcal L(W,\mathcal S)\)
- \(W^* = min_{W\in \mathcal W}\mathcal L(W,\mathcal S)\)
- \(\mathcal L(E,\mathcal S) = \frac 1P\sum_{i=1}^P L(Y^i, E(W,\mathcal y, X^i)) + R(W)\)
- \(Y^i\): answer, fixed. \(\mathcal y\): varying answer
- \(R(W)\): regularize, prior
- use theories of statistical learning
2.1 Designing a Loss Functional
- shape the energy surface
- push, pull
- the architecture (model), the loss function, the learning algorithm (3 elements shared with common ML), the inference algorithm
- prior: architecture and the loss function (regularize)
- effective, efficient
2.2 Examples of Loss Functions
- concentrate on the data-dependent part
- discuss, 'good', 'bad'
- energy loss: just the energy, push down, not pull up, collapsed solution, zero
- works: automatically pull, \(E(W, Y^i,X^i)=||Y^i-G(W,X^i)||^2\), inference is trivial, MSE