(Part of Math and Wellness Month.)
Say you have a biased coin that comes up Heads 80% of the time. (I like to imagine that the Heads side has a portrait of Bernoulli.) Flip it 100 times. The naïve way to report the outcome—just report the sequences of Headses and Tailses—costs 100 bits. But maybe you don't have 100 bits. What to do?
One thing to notice is that because it was a biased coin, some bit sequences are vastly more probable than others: "all Tails" has probability 0.2100 ≈ 1.268 · 10−70, whereas "all Heads" has probability 0.8100 ≈ 2.037 · 10−10, differing by a factor of sixty orders of magnitude!!
Even though "all Heads" is the uniquely most probable sequence, you'd still be pretty surprised to see it—there's only one such possible outcome, and it only happens a 2.037 · 10−10th of the time. You probably expect to get a sequence with about twenty Tails in it, and there are lots of those, even though each individual one is less probable than "all Heads."
Call the number of times we flip our Bernoulli coin N, and call the entropy of the coinflip H. (For the 80/20 biased coin, H is ⅕ lg 5 + 4/5 lg 5/4 ≈ 0.7219.)
It turns out for sufficiently large N (I know, one of those theorems, right?), almost all of the probability mass is going to live in a subset of 2NH outcomes, each of which have a probability close to 2−NH (and you'll notice that 2NH · 2−NH = 1).