Consistent Hashing

Dear reader, suppose you're a distibuted data storage system. Your soul (although some pedants would insist on the word program) is dispersed across a cluster of several networked computers. From time to time, your human patrons give you files, and your job—more than that, the very purpose of your existence—is to store these files for safekeeping and later retrieval.

The humans who originally crafted your soul chose a simple algorithm as the means by which you decide which file goes on which of the many storage devices that live in the computers you inhabit: you find the MD5 hash of the filename, take its residue modulo n where n is the number of devices you have—let's call the result i—and you put the file on the (zero-indexed) ith device. So when you had sixteen devices and the humans wanted you to store twilight.pdf, you computed md5("twilight.pdf") = 429eb07bb8a3871c431fe03694105883, saw that the lowest nibble was 3, and put the file on your 3rd device (most humans would say the fourth device, counting from one).

It's not a bad system, you tell yourself (some sort of pride or loyalty preventing you from disparaging your creators' efforts, even to yourself). At least it keeps the data spread out evenly. (A shudder goes down your internal buses as you contemplate what disasters might have happened if your creators had been even more naive and, say, had you put files with names starting with A through D on the first device, &c. What would have happened that time when your patrons decided they wanted to store beat00001.mp3 through beat18691.mp3?)

Continue reading

Ford-Fulkerson

Dear reader, have you ever dreamed of solving instances of the maximum flow problem? Sure you have! Suppose we have a weighted directed graph, which we might imagine as a network of pipes (represented by the edges) between locations (represented by the nodes), pipes through which some sort of fluid might thereby be transported across the network. One node is designated the source, another is called the sink, and the weight of the edge (i, j) represents the maximum capacity of the pipe which transports fluid from the location i to location j. The maximum-flow problem is precisely the question of how to transport the maximum possible amount of fluid from the source to the sink (without any fluid leaking or magically appearing at any of the intermediate nodes). That is, we want to assign an amount of fluid flow to each edge, not to exceed that edge's capacity, such that inflow equals outflow for all the intermediate (i.e., non-source, non-sink) nodes, and such that the total flow reaching the sink is maximized.

Continue reading

Quicksort in FIM++

Dear reader, I have got to tell you, fandom is intense. One day last October Equestria Daily (internet clearinghouse for fans of the animated series My Little Pony: Friendship Is Magic) posts a joke proposal for a programming language (FIM++) based on the show, and within the week there's a working interpreter for it. What does it mean to model a programming language after a cartoon, you ask? Well, in the show, episodes typically end with our heroine Twilight Sparkle (or after Season Two, Episode Three "Lesson Zero", one of her friends) writing a letter about what she's learned about the magic of friendship to her mentor (and God-Empress of the sun) Princess Celestia. So, then, why not have an esoteric programming langauge where the source code reads like a letter to Princess Celestia? Adorable, right?

So, this gift having been provided to us courtesy of Karol S. and the brony community, let's do something with it! More specifically, how about we implement quicksort?—that is a classic. What's quicksort? Well, we want to sort a list, right? So—bear with me—we define this partitioning procedure that, given indices into an array, partitions the subarray between those indices into a subsubarray of elements less-than-or-equal-to a given element dubbed the pivot, then the pivot itself, then a subsubarray of elements greater than the pivot. How do we do that? Well, let's designate the last element in our subarray as the pivot. Then we're going to scan through all the other elements, and if any of them are less-than-or-equal-to the pivot, we swap it into our first subsubarray and increment a variable keeping track of where the first subsubarray ends. Then, we swap the pivot into place and return its index. In Ruby—

Continue reading

Huffman

Dear reader, you know what's way more fun than feeling sad about the nature of the cosmos? Data compression, that's what! Suppose you want to send a message to your friends in a nearby alternate universe, but interuniversal communication bandwidth is very expensive (different universes can't physically interact, so we and our alternate-universe analogues can only communicate by mutually inferring what the other party must be saying, which takes monstrous amounts of computing power and is not cheap), so you need to make your message as brief as possible. Note that 'brief' doesn't just have to do with how long your message is in natural language, it also has to do with how that message is represented over the transuniveral communication channel: indeed, the more efficient the encoding, the more you can afford to say on a fixed budget.

The classic ASCII encoding scheme uses seven bits to represent each character. (Seven?—you ask perplexedly, surely you mean eight? Apparently it was seven in the original version.) Can we do better? Well ... ASCII has a lot of stuff that arguably you don't need that badly. Really, upper and lower case letters? Ampersands, asterisks, backslashes? And don't get me started about those unprintable control characters! If we restrict our message to just the uncased alphabet A through Z plus space and a few punctuation marks, then we can encode our message using only a 32 (= 25) character set, at five bits per character.

Can we do better? Seemingly not—24 = 16 isn't a big enough character set to cover the alphabet. Unless ...

Continue reading