Deep Learning Flower Book 2
Probability and Information theory
Overview
We just handle the area that we are unfamiliar with.
Uncertainty
- Inherent stochasticity in the system being modeled.
- Incomplete observability.
- Incomplete modeling.
In many cases, it is more practical to use a simple but uncertain rule rather than a complex but certain one, even if the rule is deterministic and our modeling system has the fidelity to accomodate a complex rule.
- Frquentist Probability is directly related to the rates at which events that is repeatable occur.
- Bayesian Probability is related to qualitative levels of certainty.
Expectation, Variance and Covariance
- Expectation:
- Variance gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution:
- Covariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:
High absolute values of the covariance mean that the values change very much and are both from their respective means at the same time. If the sign of the covariance is both positive, then both variables tend to take on relatively high values simultaneously. If the sign of the covariance is negative, then one variable tends to take on a relatively low value and vice versa.
The covariance matrix:
Gaussian Distribution (Normal Distribution)
Normal distribution is a sensible choice for many applications. In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice for two major reasons:
- The central theorem shows that the sum of many independence random variables is approximately normally distributed.
- Out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. This means it will need the least amount of prior knowledge into a model.
Multivariate Normal Distribution
Mixtures of Distribution
I am mengbiing...
Useful Properties of Common Functions
Logistic sigmoid function:
This is commonly used to generate the parameter in the Bernoulli Distribution.
Softplus function:
It can be useful for producing the parameter in the normal distribution.
Useful properties:
Information Theory
This is a branch of applied mathamatics that resolves around quantifying how much information is present in signal.
The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred.
- Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever.
- Less likely invent should have more information content.
- Independent events should have additive information.
To satify all three of these properties, we define self-information:
We can quantify the amount of uncertainty in an entire probability distribution using the Shannon entropy.
If we have two separate probability distribution P(x) and Q(x) over the same random variable x, we can measure how different these two distributions are using the KL divergence:
In the case of discrete variables, it is the extra amount of information needed to sent a message containing symbols drawn from the probability distribution P, when we use a code that was designed to minimize the length of messages drawn from the probability distribution Q.
Cross-entropy:
also:
Structured Probabilistic Models
Base: Suppose that a influences b, b influences c, a and c are independent given by b. Then we have:
Directed Graph:
A directed graph contains one factor for every random variable in the distribution and the factor consists of the conditional distribution over this variable given the parents of it, denoted by:
Undirected Graph:
Still mengbiing.