1. Arbitrary Incrementer in PHP

    On several recent occasions I had a need for an incrementer that uses an arbitrary character set and I thought I'd share my code with you.

    I've used this code in the GPL Virus that I wrote to poke fun at the Wordpress/Thesis/GPL debacle, as well as in some clean up I'm doing for the extremely useful JS Bin project.

    The most important application, however, was in creating a URL shortening system for the as-yet-unannounced startup project that I'm working on.

    I wanted the URL shortener to make the shortest possible URLs. To keep the number of characters in a URL short, I had to increase the set of characters that could comprise a key.

    To illustrate this, consider a hexadecimal number versus its decimal equivalent:

    $num = 32323232321;
    echo $num . "\n";
    echo dechex($num) . "\n";

    This outputs:

    32323232321
    7869d6241

    As you can see, the second number is two characters shorter than the first number. The reason for this is that every digit of a decimal number is represented by one of 0123456789 (10 unique characters), while ever digit of the hexadecimal number is represented by one of 0123456789abcdef (16 unique characters). This means that we can pack more information into each digit, making the overall length of the key shorter.

    PHP has a base_convert() function that allows any sequential base up to 36 (the number of letters in the alphabet (26) plus the 10 numeric digits). We can further compress the above example by increasing the base from 16 (hexadecimal) to 36:

    $num = 32323232321;
    echo $num . "\n";
    echo base_convert($num, 10, 16) . "\n";
    echo base_convert($num, 10, 36) . "\n";

    Using the full spectrum saves us 4 characters:

    32323232321
    7869d6241
    eukf1oh

    Unfortunately, base_convert() does not take the base beyond 36. I wanted to increase the information density (and thus decrease the length of the tokens) even further. URLs are case-sensitive, so why not use both uppercase and lowercase letters? We might as well throw in a few extra characters (- and _).

    Additionally, I wanted to be able to increment the sequence, based on the current maximum value. PHP offers no facility as simple as base_convert for this (and the $a = "zzz"; echo ++$a; trick doesn't quite do what I need).

    After a bit of code wrangling, I came up with the following algorithm that allows an arbitrary character set, and increments over it, recursively.

    function inc($n, $pos=0)
    {
        static $set = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_';
        static $setmax = 61;
     
        if (strlen($n) == 0) {
            // no string
            return $set[0];
        }
     
        $nindex = strlen($n) - 1 - $pos;
        if ($nindex < 0) {
            // add a new digit to the front of the number
            return $set[0] . $n;
        }
     
        $char = $n[$nindex];
        $setindex = strpos($set, $char);
     
        if ($setindex == $setmax) {
            $n[$nindex] = $set[0];
            return inc($n, $pos+1);
        } else {
            $n[$nindex] = $set[$setindex + 1];
            return $n;
        }
    }

    To change the set, simply alter the $set variable, and adjust the $setmax accordingly. I hope you find this as useful as I have.

    After writing this piece, but before publishing it, I stumbled upon some similar code that they use at Flickr to do arbitrary base conversion, so take a peek over there to see how they handle this.

    0 Responses

    Feed for this Entry