• Entropy

    Entropy is the expected amount of information when an event is drawn from a distribution. Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy. Shannon entropy is given by:

    The expectation in this case is probability-weighted average on all possible k states.

  • Information Gain

    Information gain is indirectly proportional to the probability of the event. An event having a probability of occurrence 1 has no information at all (it is certain to happen). An event having 0 probability of occurrence contains the most information.

    Information gained when an unfair coin is tossed is less than information gained on a fair coin, since unfair coin’s outcome is the unfair side most of the times.

  • Surprise Factor

    The surprise factor is directly proportional to information gain. The most probable event will have the least surprise factor (eg: sun rises in the east). Least probable event will have the most surprise factor (eg: sudden death of xyz)

Cross Entropy

The expected amount of information gained when a scheme optimised for one distribution is applied to another distribution is quantified by cross-entropy.

Amount of information gained when you think I’m tossing a fair coin but secretly, I’m tossing an unfair coin is given by

On the other hand, amount of information gained when you think I’m tossing an unfair coin but secretly, I’m tossing a fair coin is given by

In any scenario , because whenever the unfair coin comes up with anything other than the unfair side, you’re pretty surprised. But when I toss the fair coin, it comes up something other than unfair side most of the time – so if you think I’m tossing the unfair coin but I’m not, you’re pretty surprised most of the time!

Kullback - Leiber divergence

The penalty charged when one optimization scheme is used on other distribution is quantified by KL divergence

In other words, the extra information gained when I toss a fair coin but you mistakenly believe I’m tossing an unfair coin than if I toss the fair coin and you correctly believe I’m doing so.

Facts about KL divergence

  • KL divergence is non-negative
  • KL divergence is 0 if and only if P and Q are of the same distribution (incase of discrete variables), or equal “almost everywhere” (incase of continuous variables)
  • It is often conceptualized as measuring distance between distributions, but it is not actually a distance measure since it doesn’t follow the triangle law (not commutative)

Usage in Machine Learning

  1. KL divergence is used in ML to measure the information loss in the fitted model relative to that in the reference model
  2. It is widely used in variational inference, where an optimization problem is constructed that aims at minimizing the KL-divergence between the intractable target distribution P and a sought element Q from a class of tractable distributions.