### Background

• ###### Entropy

Entropy is the expected amount of information when an event is drawn from a distribution. Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy. Shannon entropy is given by: The expectation in this case is probability-weighted average on all possible k states.

• ###### Information Gain

Information gain is indirectly proportional to the probability of the event. An event having a probability of occurrence 1 has no information at all (it is certain to happen). An event having 0 probability of occurrence contains the most information. Information gained when an unfair coin is tossed is less than information gained on a fair coin, since unfair coin’s outcome is the unfair side most of the times.

• ###### Surprise Factor

The surprise factor is directly proportional to information gain. The most probable event will have the least surprise factor (eg: sun rises in the east). Least probable event will have the most surprise factor (eg: sudden death of xyz)

### Cross Entropy

The expected amount of information gained when a scheme optimised for one distribution is applied to another distribution is quantified by cross-entropy.

Amount of information gained when you think I’m tossing a fair coin but secretly, I’m tossing an unfair coin is given by On the other hand, amount of information gained when you think I’m tossing an unfair coin but secretly, I’m tossing a fair coin is given by In any scenario , because whenever the unfair coin comes up with anything other than the unfair side, you’re pretty surprised. But when I toss the fair coin, it comes up something other than unfair side most of the time – so if you think I’m tossing the unfair coin but I’m not, you’re pretty surprised most of the time!

### Kullback - Leiber divergence

The penalty charged when one optimization scheme is used on other distribution is quantified by KL divergence In other words, the extra information gained when I toss a fair coin but you mistakenly believe I’m tossing an unfair coin than if I toss the fair coin and you correctly believe I’m doing so.

2. It is widely used in variational inference, where an optimization problem is constructed that aims at minimizing the KL-divergence between the intractable target distribution P and a sought element Q from a class of tractable distributions.