Recent Happenings

I've got a bunch of stuff that I haven't found/made time to blog about, so just dropping some quick notes here:

  • I've been invited to speak at PHP Quebec 2009. I've been to this conference a few times (but not for a couple years, now), and I'm really looking forward to getting back into the conference circuit (as a speaker, not an organizer... think of all the free time I'll have! (-; Anyway, I'll be giving a talk entitled "Stupid Browser Tricks" in which I'll talk (at a high level) about Firebug, and Selenium IDE, and possibly a few other things like granular browser security, komodo macros/extensions (like a browser!) and maybe greasemonkey.
  • This year, I was once again invited back to the Microsoft Web Developers Summit (couldn't think of a better URL). This is a yearly event where Microsoft selects members of the PHP community to Redmond to have a discussion on PHP and Microsoft's offerings. This year was definitely the best one yet, as it was better organized, and it felt much less like they were trying to sell us things. Their candor was especially appreciated this year, as I think many of the attendees felt like Microsoft was asking us for our opinions instead of trying to give them to us. I wrote about this last year, and I think what I wrote still rings true, today. Thanks to the organizers... we got some great information, made our opinions clear, and had a LOT of fun (great people!).
  • I tweeted about this, but never posted it on my blog. My colleague Luke Welling is a funny guy.
  • Over the holiday weekend (I got days off, but in Canadia, we celebrate Thanksgiving in October), I found some time to work on a bunch of pet projects, including fale.ca, which is nothing special, but kind of fun. See?
  • Today, I was extended an invitation to join the Habari Cabal, which I quickly accepted. So, if you use Habari and your blog breaks in the future, it's probably my fault.
  • ... and last, but not least, Chris and I—with the help of many other people—managed to almost get the 2008 PHP Advent calendar launched in time. Word on the street is that Jon Tan is going to show the design some love, and we have a feed. The 2007 edition was a success, but was a lot of work, so I offered to pitch in this year. Thanks to everyone who's already submitted... and the rest of you slackers: get to it! (-;
  • S

UTF: WTF?

Note: This article first ran in php|architect in March 2008, while I still worked at MTA. Marco (the publisher, and my former colleague) has graciously agreed to allow me to republish this in a more public forum. I've wanted to link a few people to it in the past few months and until now that was only possible if they were php|architect subscribers. That said, if you're into PHP, you really should subscribe to php|a.

As you might know, one of my roles at php|architect is to organize and manage speakers (and their talks) for our PHP conferences.

A while back, PHP 6's main proponent, Andrei Zmievski, submitted a talk that we accepted, entitled "I ♥ Unicode, You ♥ Unicode." When we selected the talk and invited Andrei to attend the conference, he accepted and humorously suggested that we pay special attention to the talk's heart characters when publishing details on the conference website and in other promotional materials. I took his suggestion as wise advice, and double checked the site before releasing it to the public—it worked perfectly.

Within a few hours of publication, Andrei dropped me a note indicating that I hadn't heeded his warning, and that the s weren't showing up properly. The problem turned out to be a bug in a specific version of Firefox, and I believe we resolved it by employing the entity. This ordeal, while minor, was my first taste of how bad things would become.

If I had to guess, I would estimate that I've spent somewhere in the range of 40 hours wrangling UTF-8 in the past 3 months, which is not only expensive for my employer, but also disheartening as a developer who's got real work to do. Admittedly, this number is inflated, due to the heavy development cycle we completed with the launch of our new site. As time goes on, though, I don't see this situation improving in the short term (though, if we were to glimpse much further into the future, I'm sure we'll eventually consider this a solved problem).

The main problem with using Unicode, today, is that it's partially supported by some parts of any given tool chain. Sometimes it works great, and other times—due to a given piece of software's lack of implementation (or worse, a partial implementation), human error, or full-on bugs—the chain's weakest link shatters in a non-spectacular way.

As any experienced developer knows, having the weak point of a process collapse is a normal part of building complex systems. We're used to it, and we usually manage this by making the systems less complex, by eliminating the parts that are prone to collapse, or by fixing the broken parts. When implementing a system that may contain Unicode data, today, we're challenged with many potential points of failure that are often difficult to identify, and nearly impossible to replace.

To illustrate, consider an overly simplified web development work—and content delivery—flow: developer creates a file, developer edits file, developer uploads the files to the web server, httpd receives a request from a browser, httpd passes the request to PHP, PHP delivers content back to httpd, httpd delivers content to the visitor's browser. If a single part of this flow fails to handle Unicode properly, a snowball effect causes the rest of the chain to fail.

A more typical flow for me (and our code) goes something like this: create file, edit file, commit file to svn, other developers edit file, others commit to svn, release is rolled from svn, visitor browser requests page, httpd parses request, httpd delivers request to PHP, PHP processes request, PHP (client) calls service to fulfill back-end portions of request (encodes the request in an envelope—we use JSON most of the time), PHP (service) receives request, service retrieves and/or stores data in database, service returns data to PHP client, PHP client processes returned data and in turn delivers it to httpd, httpd returns data to browser.

If you'll bear with me for one last list in this article, that means that any (one or more!) of the following could fail when handling unicode: developers' editors, developers' transport (either upload or version control), user's browser, user's http proxy, client-side httpd, client-side PHP, client-side encoder (JSON), service-side httpd (especially HTTP headers), service-side decoder, service-side PHP, service-side database client, database protocol character set imbalance, database table charset, database server, service-side encoder, client-side decoder, client-side PHP (again), client-side httpd (including HTTP headers, again), user's proxy (again), and user's browser (again). I've probably even left some out.

As you can see, there are so many points of failure here, that determining the source of an invalid UTF-8 character is torturous, at best.

Recently, I had to wrestle UTF-8 monsters. In my case, it was a combination of user (me) error and an actual bug in PHP, but it was so non-obvious that it caused most of my day to melt away, trying to resolve the issue. In my case, I had decided to split a file that contained UTF-8 characters into two files. By default, my editor of choice creates new files using my system character encoding—which happened to be Mac-Roman because I hadn't changed it from Leopard's default. The original file was UTF-8, and the characters displayed normally in the new Mac-Roman file. However, when the data was passed to PHP's json_encode function, the string was arbitrartily truncated, due to a PHP bug .

Because the script that triggered the bug pulled the data from a database, and the data was inserted by another script—the one with the broken encoding/characters—it took me entirely too long to trace it back to the change I'd made to that now-split file. For a while, I even thought that MySQL was storing the data poorly because we'd had problems with that before, and also because the database client I was using that day was reporting the characters improperly, due to its own encoding issues. I believe my blood pressure skyrocketed to dangerous levels, that afternoon.

Universal Unicode support is going to be a long uphill battle. I'm not sure I'm ready for it, but I hope it's worth it, nonetheless.

PHP-Aware Diff

UPDATE (and intentionally reinserted into the feed):

I've made a bunch of changes to this code, and updated it.

It's quite a bit slower, but I really don't care (-:

It uses my new pet project, the tokalizer.

You'll probably want to grab the newly-compiled diff-php as this is the one I'll be "maintaining" (ie, when someone complains, or when it breaks for me).

(end update)

I've told a few people that I'd blog about this "soon" and that was a while ago, so I figured I'd better get on the ball.

I tweeted this almost two weeks ago:

Derick responded saying that diff -p does this for C. I tried it with PHP, and it gave me the outermost block where the change occurred (ie, the class, not the function). The
interesting thing, though, is that it changed the @@ line:

@@ -32,7 +32,7 @@ class Foo2 {

Almost what I was looking for, not not quite. I really wanted a php-aware diff that could tell me context.

So, what's a developer with almost no spare time on his hands (but an idea of how to actually accomplish this pet project) to do? Write it himself, of course! (-:

So, I did. Here's an example of the output:

--- tmp/left.php

+++ tmp/right.php

@@ -1,7 +1,7 @@ (root)

 <?php

 class Foo {

     function bar() {

-        // baz!

+        // bax!

     }

 }

 

@@ -32,7 +32,7 @@ (root):Foo2(class)

 // k

 // l

     function bar2() {

-        // baz2!

+        // bax2!

     }

 }

 

@@ -63,7 +63,7 @@ (root):Foo3(class):bar3(function)

 // k

 // l

         $test = "foo {$test}";

-        // baz2!

+        // bax2!

     }

 

     function bar4() {

@@ -93,7 +93,7 @@ (root):Foo3(class):bar4(function):bar5(function)

 // k

 // l

             $test = "foo {$test}";

-            //baz5

+            //bax5

 // a

 // b

 // c

Here's the code for my php-aware diff. I use it as my default svn diff command now (see comments). Hope you find it useful, I sure do.

#!/usr/bin/php

<?php

/// PHP-Aware diff



/// Copyright 2008, Sean Coates

///   Usage of the works is permitted provided that this instrument is retained

///   with the works, so that any entity that uses the works is notified of this

///   instrument.

///   DISCLAIMER: THE WORKS ARE WITHOUT WARRANTY.

/// (Fair License - http://www.opensource.org/licenses/fair.php )

/// Short license: do whatever you like with this.





//// save this file as diff-php

////    and make sure /path/to/diff-php is chmod +x



//// TO USE from cli:

////    /path/to/diff-php leftfile rightfile   # (compares files, as diff does)



////

//// TO USE from svn:

////    in ~/.subversion/config, add: diff-cmd = /path/to/diff-php



//// You might need to adjust DIFF_PATH, below



// the tokenizer scares me a bit (-:



class DiffPHP {

   

    const DEBUG_SYNTAX = false; // set to true to get syntax error data (== broken diffs)

   

    const DIFF_PATH = '/usr/bin/diff';

    const DIFF_OPTS = '-u';

   

    /**

     * The "left" file, as passed by svn (or cli)

     */


    protected $left;



    /**

     * The "right" file, as passed by svn (or cli)

     */


    protected $right;



    /**

     * A "nice" version of the left file.

     *

     * Instead of foo/bar/.svn/base/whatever.php, it would just be whatever.php

     */


    protected $niceLeft;



    /**

     * A "nice" version of the right file.

     *

     * Instead of foo/bar/.svn/base/whatever.php, it would just be whatever.php

     */


    protected $niceRight;



    /**

     * Captured file contents (prevents reading the file twice + diff)

     */


    protected $fileContents;

   

    /**

     * The output from the diff executable

     */


    protected $diff;

   

    /**

     * Each chunk of the diff goes in here (begins with a @@ identifier line)

     */


    protected $chunks;

   

    /**

     * Array of tokens from the Left file

     */


    protected $tokens;

   

    /**

     * Mapping of source lines to source class/functions

     */


    protected $lineMap;

   

    /**

     * Current context (used to construct line map)

     */


    protected $context;

   

    /**

     * Brace depth (used to determine if we're still in the current context)

     */


    protected $braceDepth;

   

    /**

     * Bool flag to indicate that syntax is somehow broken

     */


    protected $isBroken;

   

    /**

     * Object-wide index to keep track of the current token number

     */


    protected $tokenIndex;

   

    /**

     * Currently parsing token value

     */


    protected $currentValue;

   

    /**

     * Constructor. The magic happens here. Once instantiated, the entire

     * process runs

     */


    public function __construct() {

        $this->parseArgs();

       

        $this->fileContents = file_get_contents($this->left);



        $this->doDiff();

       

        // subject (probably) IS a PHP file:

        if (!isset($_ENV['NODIFFPHP']) && stripos($this->fileContents, '<?') !== false) {

            $this->splitDiff();

            $this->determineHierarchy();

            $this->reconstructDiff();

        } else {

            // not a PHP file; return regular diff:

            echo $this->diff;

        }

    }

   

    /**

     * Parses the passed arguments.

     *

     * Determines if it's svn (7 args) or cli (2 args), and stores the parsed

     * arguments.

     */


    protected function parseArgs() {

        // if this is being called from svn, we'll get 4 arguments

        //   (8th is argv 0 == this script)

        if (8 == $_SERVER['argc']) {

            $this->niceLeft = $_SERVER['argv'][3];

            $this->niceRight = $_SERVER['argv'][5];

            $this->left = $_SERVER['argv'][6];

            $this->right = $_SERVER['argv'][7];

        } else if (3 == $_SERVER['argc']) {

            // 2 arguments means a regular diff

            $this->niceLeft = $_SERVER['argv'][1];

            $this->niceRight = $_SERVER['argv'][2];

            $this->left = $this->niceLeft;

            $this->right = $this->niceRight;

        } else {

            die("See " . __FILE__ . " for details on how to use this script\n");

        }

    }

   

    /**

     * Calls the external diff program to get the base diff

     */


    protected function doDiff() {

        if (is_readable($this->left) && is_readable($this->right)) {

            $diffCmd = self::DIFF_PATH . ' ' . self::DIFF_OPTS . " {$this->left} {$this->right}";

            $this->diff = `$diffCmd`;

        } else {

            die("{$this->left} or {$this->right} is not readable\n");

        }

    }

   

    /**

     * Takes an identifier line (looks like: @@ -30,23 +30,79 @@) and returns

     * the begin line number

     */


    protected function parseLineNum($identifier) {

        list(,$from) = explode(" ", $identifier);

        list($from) = explode(',', $from);

        return (int) substr($from, 1);

    }

   

    /**

     * Sanitizes CRLF or CR into just LF

     */


    protected function sanitizeLineEndings($data) {

        // first, sanitize line endings:

        $data = str_replace("\r\n", "\n", $data);

        $data = str_replace("\r",   "\n", $data);

        return $data;

    }    

   

    /**

     * Actually splits the diff into chunks and stores chunks + line numbers

     */


    protected function splitDiff() {

        // now split:

        $this->diff = explode("\n", $this->sanitizeLineEndings($this->diff));

       

        // array to return:

        $this->chunks = array();

       

        // line counter

        $line = 0;

       

        // outer loop: file(s)

        $maxLine = count($this->diff);

   

        // skip first 2 lines as left, right files

        $line += 2;

   

        // descend into data chunks

        while ($line < $maxLine) {

            // next line is the chunk identifier

            $dataChunk = array();

            $dataChunk['identifier'] = $this->diff[$line++];

            $dataChunk['line'] = $this->parseLineNum($dataChunk['identifier']);

            $dataChunk['data'] = array();

            while ($line < $maxLine && !(substr($this->diff[$line], 0, 2) == '@@' && substr($this->diff[$line], -2) == '@@')) {

                $dataChunk['data'][] = $this->diff[$line++];

            }

            $this->chunks[] = $dataChunk;

        }

    }

   

    /**

     * Reconstructs the diff (with adjusted identifier lines, and outputs the

     * result)

     */


    protected function reconstructDiff() {

        $out = "--- {$this->niceLeft}\n+++ {$this->niceRight}\n";

        foreach ($this->chunks as $chunk) {

            $out .= $chunk['identifier'] . "\n";

            $out .= implode("\n", $chunk['data']) ."\n";

        }

        echo $out;

    }

   

    /**

     * Descends into a deeper context

     *

     * @param string $type friendly name, either class or function

     */


    protected function enterContext($type) {

        // next comes whitespace:

        if (is_array($this->tokens[++$this->tokenIndex])) {

            list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];

        } else {

            $token = null;

            $this->currentValue = $this->tokens[$this->tokenIndex];

        }

        if ($token != T_WHITESPACE) {

            // syntax is broken, let's get out of here

            if (self::DEBUG_SYNTAX) {

                die("Syntax broken in whitespace assertion, " . $this->context[count($this->context) - 1] . "\n");

            }

            $this->isBroken = true;

            break;

        }

        $this->checkLineBreak();

       

        // next comes the name:

        if (is_array($this->tokens[++$this->tokenIndex])) {

            list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];

        } else {

            $token = null;

            $this->currentValue = $this->tokens[$this->tokenIndex];

        }

        $this->context[] = $this->currentValue . "({$type})";

       

        // chew through the next few tokens until we get a "{"

        while ($this->currentValue != '{' && $this->tokenIndex < count($this->tokens)) {

            if (is_array($this->tokens[++$this->tokenIndex])) {

                list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];

            } else {

                $token = null;

                $this->currentValue = $this->tokens[$this->tokenIndex];

            }

            $this->checkLineBreak();

            switch ($token) {

                // these are all valid before the brace:

                case null:

                case T_WHITESPACE:

                case T_VARIABLE:

                case T_EXTENDS:

                case T_IMPLEMENTS:

                case T_STRING:

                case T_ARRAY:

                case T_CONSTANT_ENCAPSED_STRING:

                case T_LNUMBER:

                case '=':

                    break;

               

                // if another token is found, then there's a syntax error

                // (this was added to prevent really deep looping)

                default:

                    if (self::DEBUG_SYNTAX) {

                        die("Syntax broken in token assertion, " . $this->context[count($this->context) - 1] . "," . token_name($token) . "\n");

                    }

                    $this->isBroken = true;

                    return;

            }

        }

       

        // found the starting brace

        $this->braceDepth[count($this->context) - 1] = 1;

    }    

   

    /**

     * Tokenizes the code and creates a line map

     */


    protected function tokenizeHierarchy() {

        $this->context = array('(root)');

        $this->lineMap = array('');

        $this->tokens = token_get_all($this->sanitizeLineEndings($this->fileContents));

        $this->isBroken = false;

        for ($this->tokenIndex=0; $this->tokenIndex<count($this->tokens); $this->tokenIndex++) {

            if ($this->isBroken) {

                // syntax is somehow broken; return progress, but don't go further

                return;

            }

            if (is_array($this->tokens[$this->tokenIndex])) {

                list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];

            } else {

                $token = null;

                $this->currentValue = $this->tokens[$this->tokenIndex];

                //change here

            }

           

            switch ($token) {

                // check for class

                case T_CLASS:

                    // found "class"

                    $this->enterContext('class');

                    break;

               

                case T_FUNCTION:

                    // found "function"

                    $this->enterContext('function');

                    break;

               

                default:

                    $idx = count($this->context) - 1;

                    switch ($this->currentValue) {

                        case '{':

                        case T_CURLY_OPEN:

                        case T_DOLLAR_OPEN_CURLY_BRACES:

                            ++$this->braceDepth[$idx];

                            break;

                       

                        case '}':

                            --$this->braceDepth[$idx];

                            if ($this->braceDepth[$idx] == 0) {

                                // we're out of this context

                                array_pop($this->context);

                            } else if ($this->braceDepth[$idx] < 0) {

                                // bad stuff!

                                if (self::DEBUG_SYNTAX) {

                                    die("Syntax broken in brace close assertion, " . $this->context[count($this->context) - 1] . "\n");

                                }

                                $this->isBroken = true;

                            }

                            break;

                       

                        default:

                            $this->checkLineBreak();

                    }

            }

        }

    }

   

    /**

     * Determines if the currently processing token contains line breaks, and

     * if so, adjusts the lineMap accordingly

     */


    protected function checkLineBreak() {

        // check for new line:

        if (strpos($this->currentValue, "\n") !== false) {

            for ($j=1; $j<=substr_count($this->currentValue, "\n"); $j++) {

                $this->lineMap[] = implode(':', $this->context);

            }

        }

    }

   

    /**

     * Matches the chunk map to the line map

     */


    protected function determineHierarchy() {

        $this->tokenizeHierarchy();

        for ($chunknum=0; $chunknum < count($this->chunks); $chunknum++) {

            $this->chunks[$chunknum]['identifier'] .= ' ' . $this->lineMap[$this->chunks[$chunknum]['line']];

        }

    }

}



new DiffPhp;



// komode: le=unix language=php codepage=utf8 tab=4 notabs indent=4

The most up-to-date version of this file can also be found in my personal svn repostory: https://svn.caedmon.net/svn/public/diff-php/diff-php.

Please let me know if you run into any bugs.. I'm sure there are a few, but it works pretty well for me.

S

Is Pagination Still Necessary?

My first network connection device was a 2400 baud modem. Practically speaking, that would allow me to sustain downloads at a rate of less than 250 bytes per second. This was relatively fast at the time; I'd been using my buddy's 1200 baud modem to connect to local BBSs before that modem-netting birthday.

To put this into perspective, the Yahoo! homepage, all considered, is somewhere around 470kB. On my early-90s era modem, it would have taken a little over 30 minutes (half of one hour) to download (in perfect conditions, without protocol overhead (good ol' zmodem), and if my mom didn't happen to pick up the phone during transfer).

For the past few years, I've had a 10 megabit connection (downstream) into my home/office. Under perfect conditions, I can pull the entire Y! homepage, and all attached media in less than half of one second.

In the early 90s, the Y! homepage was obviously much smaller—all pages were smaller—but even with a smaller footprint, many pages took a long time to load. I remember browsing with many windows open (browsers didn't have tabs back then... in fact, we barely had browsers (-: ), loading up a dozen or so pages before alt-tabbing back to the first one I'd queued up a few minutes before (on my 14.4kbps modem, by this time), to see if it had finally finished loading.

To overcome low connection speeds, lack of resources on the client side, and other factors such as connection latency that lead to slow page page loads, web pioneers came up with a model for allowing content to be delived in reasonable sized chunks that is still in use today: pagination. Long lists of (say "100") pieces of data ("search results") were separated into smaller pages (of "10"), including widgets to allow skipping to the next, previous and often any page in the set.

Well... mostly still in use today.

Technologies have helped us hack around the idea of separating growing amount of data into pages. Ajax, for example, allows the dynamic loading of the next set of results without forcing a page reload (often poorly... try bookmarking the result of many of these dynamic populations. Even Mobile Mail on the iPhone/iPod Touch allows something like this.

It seems to me, though, that web interface designers are stuck in this rut of showing end users a mere 10, 20 or even 100 items at a time. My 10Mb connection can handle a lot more traffic than you're sending; your server had better be able to deliver it (and usually, it can); my browser is allowed to allocate much more RAM; and I even like to think that I've microevolved the ability to parse much more data that I could a few years ago.

So, I ask you, fellow web professionals: is pagination still necessary? I obviously don't think so, but I'm not a User Experience guy, I'm a user (and also the guy who has to make the UX happen, and make sure your server can deliver the results mentioned above). Tell me what you think.

Moved My Blog (and my job)

Hi all.

Quick note to say that I've moved my blog from blog.phpdoc.info to seancoates.com, mostly because I want to shift toward a wider focus than phpdoc.info infers.

I've also changed my blogging software from the horribly slow (it became that way, I suspect partially due to database problems) Serendipity to the shiny and new Habari 0.5-dev.

I've been tinkering with Habari for a couple months now, and I'm really liking both the code and the community.

As you may or may not have already heard (and may or may not care), I've left php|architect and have started a new work-life at OmniTI. So far, I'm really liking it, especially the team of great people over here.

Anyway, I've been putting off blogging for months for various reasons. Hopefully now that one of those reasons (dying blog platform) is out of the way, I'll be able to get back on the wagon.

(BTW, I've taken care to redirect all of my old links to my new blog, but if you happen to find one that doesn't work as expected, please do let me know. Thanks!)

S

 1 2 3 … 9 Next →

About

User


Clicky Web Analytics