What happened at NIPS 2015 Deep Learning Session?
What happened at NIPS 2015 Deep Learning Session?
My notes:
Four key ingredients for ML towards AI:
- Lots of data
- Very flexible models
- Enough computing power
- Powerful priors that can defeat curse of dimensionality (*not shared by traditional kernel methods)
In DL learning can use two other priors aside from smoothness to help:
- Distributed representations / embeddings (feature learning)
- Deep models
Concepts represented by patterns of activations.
Exponential advantage of distributed representations. Non-distributed: complexity of function grows exponentially with number of features. If you apply these to imagine recognition, activation units are meaningful. Discover things they aren’t trained for. This implies these network discovers features that are meaningful and can be learned independently of each other.
Another view is that deep learning represents automated feature discovery. People thought a single nn net could represent anything, but there are functions that require exponential number of units to represent the functions.
Backpropagation System figures out how to compute gradient with respect to all parameters in the system. Rectified linear units (RELU) not exactly differentiable but close enough. Trained with stochastic gradient descent. Compute average of gradient over small batch of samples (mini batch)
Convolutional networks Applied when data comes in an array where local variables are highly correlated and useful features can be anywhere, combined with shift invariance. Convnets can be replicated over a large image very easily.
Recurrent neural networks Used for learning dynamical systems. Can represent fully connected directed generative model. Maximum likelihood = “teacher forcing”. Long term dependencies - state at time t is composition of many applications of dynamical systems across all times. Singular values of jacobians can make gradients explode. To have stable learning need singular values of gradients < 1. Gradient norm clipping: if norm of gradient is above threshold then reduce to threshold. Special architecture: LSTM. Can use multiple timescales.
Backpropagation in practice Large neural nets converge to many local minima but most equivalent (convexity is not needed).
Stochastic neurons as a regulariser Dropout trick: multiple output neuron by random bit during test (p = 0.5). Works very well on large networks. Batch normalisation: standardize activations across mini batch. Random sampling of hyperparmeters. Common approach: manual + grid search. Random search: simple and efficient. Use Gaussian Processes to approximate surface of system.
Applications: Vision Facebook: 700 million photos per day, goes through two convnets: object recognition and face recognition. Hardware companies are tuning hardware for convnets. Convnets for pedestrian detection. Application to 2D images: scene parsing/labelling (i.e. can you label a building a building etc). Dimensionality reduction by learning and invariant mapping. Applications of RNNs: pose reconstruction, start with graphical model and train the whole thing at once. Convnets can also produce images. Variations with auto encoders that learn abstract features that then produce images
Speech recognition Deep learning dramatically improved speech recognition. End-to-end training with search: neural nets + HMMs
Natural language representations Benigo: language one of most interesting from deep learning point of view. Began in early 80s with geoff hinton's ideas. Train neural net in which first layer maps symbols into vector (word embedding or word vector). Output layer is then over full vocabulary. Neural word embeddings: visualisation directions = learned attributes. Mikolov et al ICLR can play games, arithmetic with word embedding (word2vec?) e.g. king-queen = man-woman, paris - france + italy = rome. This means there are directions in this vector space that correspond to different semantic attributes (e.g. male female etc).
To do machine translation, find some intermediate representation that works for any language then maps out to different representation.
How do humans generalise from very few examples? They transfer knowledge from previous learning: representations and explanatory factors. Need representation learning. Prior: shared underlying explanatory factors ie p(x) related to p(y|x). Unsupervised and transfer learning challenge: do unsupervised learning on the data then do supervised learning for new task.
Multi-task learning: share lower layers of net (underlying factors are common across tasks) then have more specific networks. Able to generalise very quickly. Google image search: joint embedding, share representations between multiple modalities.
Unsupervised representation learning Potential benefits: - Exploit tons of unlabelled data - Answer new questions about variables - Regularizer - transfer learning - domain adaptation - Joint (structured) outputs
Why latent factors? Causality. Depending on direction of causality between x and y, you’re in for a good time or you’re in for trouble. Example: consider mixture of three gaussians: just observing x-density reveals the causes y (cluster ID).
Invariant features - which invariances? Learning to keep all explanatory factors in the representation. Good disentangling -> avoid curse of dimensionality.
Unsupervised neural nets: Boltzmann machines - normalized exponential of ‘energy’. In order to sample from these have to go to iterative sampling scheme: stochastic relaxation, MCMC. Predictive sparse decomposition.
Probabilistic interpretation of auto-encoders: Manifold and probabilistic interpretations of auto-encoders. With particular ways of training auto encoders you actually capture the data distribution. Derivative of the log density with respect to input converges. Learns vector field to go from corrupted data to manifold. If you do encode-decode-encode-decode markov chain then stationary distribution is estimate of distribution.
More recently: helpholtz machines and variational auto-encoders (VAEs). Simultaneously train encoder and decoder and inject noise at all levels to obtain as objective function lower bound on log-likelihood.
Final remarks Unsolved problem: solution to problem will bring about leap in things we can do in ML. Problem: show segment of video and draw next frame, neural nets don’t work well - world is unpredictable, can’t just use observations. Deep learning means we should generalise to higher level abstractions. How do you evaluate unsupervised learning? Natural language understanding and reasoning? Bridging the gap to biology. Deep reinforcement learning.