PHP-Aware Diff

UPDATE (and intentionally reinserted into the feed):

I've made a bunch of changes to this code, and updated it.

It's quite a bit slower, but I really don't care (-:

It uses my new pet project, the tokalizer.

You'll probably want to grab the newly-compiled diff-php as this is the one I'll be "maintaining" (ie, when someone complains, or when it breaks for me).

(end update)

I've told a few people that I'd blog about this "soon" and that was a while ago, so I figured I'd better get on the ball.

I tweeted this almost two weeks ago:

Derick responded saying that diff -p does this for C. I tried it with PHP, and it gave me the outermost block where the change occurred (ie, the class, not the function). The
interesting thing, though, is that it changed the @@ line:

@@ -32,7 +32,7 @@ class Foo2 {

Almost what I was looking for, not not quite. I really wanted a php-aware diff that could tell me context.

So, what's a developer with almost no spare time on his hands (but an idea of how to actually accomplish this pet project) to do? Write it himself, of course! (-:

So, I did. Here's an example of the output:

--- tmp/left.php

+++ tmp/right.php

@@ -1,7 +1,7 @@ (root)

 <?php

 class Foo {

     function bar() {

-        // baz!

+        // bax!

     }

 }

 

@@ -32,7 +32,7 @@ (root):Foo2(class)

 // k

 // l

     function bar2() {

-        // baz2!

+        // bax2!

     }

 }

 

@@ -63,7 +63,7 @@ (root):Foo3(class):bar3(function)

 // k

 // l

         $test = "foo {$test}";

-        // baz2!

+        // bax2!

     }

 

     function bar4() {

@@ -93,7 +93,7 @@ (root):Foo3(class):bar4(function):bar5(function)

 // k

 // l

             $test = "foo {$test}";

-            //baz5

+            //bax5

 // a

 // b

 // c

Here's the code for my php-aware diff. I use it as my default svn diff command now (see comments). Hope you find it useful, I sure do.

#!/usr/bin/php

<?php

/// PHP-Aware diff



/// Copyright 2008, Sean Coates

///   Usage of the works is permitted provided that this instrument is retained

///   with the works, so that any entity that uses the works is notified of this

///   instrument.

///   DISCLAIMER: THE WORKS ARE WITHOUT WARRANTY.

/// (Fair License - http://www.opensource.org/licenses/fair.php )

/// Short license: do whatever you like with this.





//// save this file as diff-php

////    and make sure /path/to/diff-php is chmod +x



//// TO USE from cli:

////    /path/to/diff-php leftfile rightfile   # (compares files, as diff does)



////

//// TO USE from svn:

////    in ~/.subversion/config, add: diff-cmd = /path/to/diff-php



//// You might need to adjust DIFF_PATH, below



// the tokenizer scares me a bit (-:



class DiffPHP {

   

    const DEBUG_SYNTAX = false; // set to true to get syntax error data (== broken diffs)

   

    const DIFF_PATH = '/usr/bin/diff';

    const DIFF_OPTS = '-u';

   

    /**

     * The "left" file, as passed by svn (or cli)

     */


    protected $left;



    /**

     * The "right" file, as passed by svn (or cli)

     */


    protected $right;



    /**

     * A "nice" version of the left file.

     *

     * Instead of foo/bar/.svn/base/whatever.php, it would just be whatever.php

     */


    protected $niceLeft;



    /**

     * A "nice" version of the right file.

     *

     * Instead of foo/bar/.svn/base/whatever.php, it would just be whatever.php

     */


    protected $niceRight;



    /**

     * Captured file contents (prevents reading the file twice + diff)

     */


    protected $fileContents;

   

    /**

     * The output from the diff executable

     */


    protected $diff;

   

    /**

     * Each chunk of the diff goes in here (begins with a @@ identifier line)

     */


    protected $chunks;

   

    /**

     * Array of tokens from the Left file

     */


    protected $tokens;

   

    /**

     * Mapping of source lines to source class/functions

     */


    protected $lineMap;

   

    /**

     * Current context (used to construct line map)

     */


    protected $context;

   

    /**

     * Brace depth (used to determine if we're still in the current context)

     */


    protected $braceDepth;

   

    /**

     * Bool flag to indicate that syntax is somehow broken

     */


    protected $isBroken;

   

    /**

     * Object-wide index to keep track of the current token number

     */


    protected $tokenIndex;

   

    /**

     * Currently parsing token value

     */


    protected $currentValue;

   

    /**

     * Constructor. The magic happens here. Once instantiated, the entire

     * process runs

     */


    public function __construct() {

        $this->parseArgs();

       

        $this->fileContents = file_get_contents($this->left);



        $this->doDiff();

       

        // subject (probably) IS a PHP file:

        if (!isset($_ENV['NODIFFPHP']) && stripos($this->fileContents, '<?') !== false) {

            $this->splitDiff();

            $this->determineHierarchy();

            $this->reconstructDiff();

        } else {

            // not a PHP file; return regular diff:

            echo $this->diff;

        }

    }

   

    /**

     * Parses the passed arguments.

     *

     * Determines if it's svn (7 args) or cli (2 args), and stores the parsed

     * arguments.

     */


    protected function parseArgs() {

        // if this is being called from svn, we'll get 4 arguments

        //   (8th is argv 0 == this script)

        if (8 == $_SERVER['argc']) {

            $this->niceLeft = $_SERVER['argv'][3];

            $this->niceRight = $_SERVER['argv'][5];

            $this->left = $_SERVER['argv'][6];

            $this->right = $_SERVER['argv'][7];

        } else if (3 == $_SERVER['argc']) {

            // 2 arguments means a regular diff

            $this->niceLeft = $_SERVER['argv'][1];

            $this->niceRight = $_SERVER['argv'][2];

            $this->left = $this->niceLeft;

            $this->right = $this->niceRight;

        } else {

            die("See " . __FILE__ . " for details on how to use this script\n");

        }

    }

   

    /**

     * Calls the external diff program to get the base diff

     */


    protected function doDiff() {

        if (is_readable($this->left) && is_readable($this->right)) {

            $diffCmd = self::DIFF_PATH . ' ' . self::DIFF_OPTS . " {$this->left} {$this->right}";

            $this->diff = `$diffCmd`;

        } else {

            die("{$this->left} or {$this->right} is not readable\n");

        }

    }

   

    /**

     * Takes an identifier line (looks like: @@ -30,23 +30,79 @@) and returns

     * the begin line number

     */


    protected function parseLineNum($identifier) {

        list(,$from) = explode(" ", $identifier);

        list($from) = explode(',', $from);

        return (int) substr($from, 1);

    }

   

    /**

     * Sanitizes CRLF or CR into just LF

     */


    protected function sanitizeLineEndings($data) {

        // first, sanitize line endings:

        $data = str_replace("\r\n", "\n", $data);

        $data = str_replace("\r",   "\n", $data);

        return $data;

    }    

   

    /**

     * Actually splits the diff into chunks and stores chunks + line numbers

     */


    protected function splitDiff() {

        // now split:

        $this->diff = explode("\n", $this->sanitizeLineEndings($this->diff));

       

        // array to return:

        $this->chunks = array();

       

        // line counter

        $line = 0;

       

        // outer loop: file(s)

        $maxLine = count($this->diff);

   

        // skip first 2 lines as left, right files

        $line += 2;

   

        // descend into data chunks

        while ($line < $maxLine) {

            // next line is the chunk identifier

            $dataChunk = array();

            $dataChunk['identifier'] = $this->diff[$line++];

            $dataChunk['line'] = $this->parseLineNum($dataChunk['identifier']);

            $dataChunk['data'] = array();

            while ($line < $maxLine && !(substr($this->diff[$line], 0, 2) == '@@' && substr($this->diff[$line], -2) == '@@')) {

                $dataChunk['data'][] = $this->diff[$line++];

            }

            $this->chunks[] = $dataChunk;

        }

    }

   

    /**

     * Reconstructs the diff (with adjusted identifier lines, and outputs the

     * result)

     */


    protected function reconstructDiff() {

        $out = "--- {$this->niceLeft}\n+++ {$this->niceRight}\n";

        foreach ($this->chunks as $chunk) {

            $out .= $chunk['identifier'] . "\n";

            $out .= implode("\n", $chunk['data']) ."\n";

        }

        echo $out;

    }

   

    /**

     * Descends into a deeper context

     *

     * @param string $type friendly name, either class or function

     */


    protected function enterContext($type) {

        // next comes whitespace:

        if (is_array($this->tokens[++$this->tokenIndex])) {

            list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];

        } else {

            $token = null;

            $this->currentValue = $this->tokens[$this->tokenIndex];

        }

        if ($token != T_WHITESPACE) {

            // syntax is broken, let's get out of here

            if (self::DEBUG_SYNTAX) {

                die("Syntax broken in whitespace assertion, " . $this->context[count($this->context) - 1] . "\n");

            }

            $this->isBroken = true;

            break;

        }

        $this->checkLineBreak();

       

        // next comes the name:

        if (is_array($this->tokens[++$this->tokenIndex])) {

            list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];

        } else {

            $token = null;

            $this->currentValue = $this->tokens[$this->tokenIndex];

        }

        $this->context[] = $this->currentValue . "({$type})";

       

        // chew through the next few tokens until we get a "{"

        while ($this->currentValue != '{' && $this->tokenIndex < count($this->tokens)) {

            if (is_array($this->tokens[++$this->tokenIndex])) {

                list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];

            } else {

                $token = null;

                $this->currentValue = $this->tokens[$this->tokenIndex];

            }

            $this->checkLineBreak();

            switch ($token) {

                // these are all valid before the brace:

                case null:

                case T_WHITESPACE:

                case T_VARIABLE:

                case T_EXTENDS:

                case T_IMPLEMENTS:

                case T_STRING:

                case T_ARRAY:

                case T_CONSTANT_ENCAPSED_STRING:

                case T_LNUMBER:

                case '=':

                    break;

               

                // if another token is found, then there's a syntax error

                // (this was added to prevent really deep looping)

                default:

                    if (self::DEBUG_SYNTAX) {

                        die("Syntax broken in token assertion, " . $this->context[count($this->context) - 1] . "," . token_name($token) . "\n");

                    }

                    $this->isBroken = true;

                    return;

            }

        }

       

        // found the starting brace

        $this->braceDepth[count($this->context) - 1] = 1;

    }    

   

    /**

     * Tokenizes the code and creates a line map

     */


    protected function tokenizeHierarchy() {

        $this->context = array('(root)');

        $this->lineMap = array('');

        $this->tokens = token_get_all($this->sanitizeLineEndings($this->fileContents));

        $this->isBroken = false;

        for ($this->tokenIndex=0; $this->tokenIndex<count($this->tokens); $this->tokenIndex++) {

            if ($this->isBroken) {

                // syntax is somehow broken; return progress, but don't go further

                return;

            }

            if (is_array($this->tokens[$this->tokenIndex])) {

                list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];

            } else {

                $token = null;

                $this->currentValue = $this->tokens[$this->tokenIndex];

                //change here

            }

           

            switch ($token) {

                // check for class

                case T_CLASS:

                    // found "class"

                    $this->enterContext('class');

                    break;

               

                case T_FUNCTION:

                    // found "function"

                    $this->enterContext('function');

                    break;

               

                default:

                    $idx = count($this->context) - 1;

                    switch ($this->currentValue) {

                        case '{':

                        case T_CURLY_OPEN:

                        case T_DOLLAR_OPEN_CURLY_BRACES:

                            ++$this->braceDepth[$idx];

                            break;

                       

                        case '}':

                            --$this->braceDepth[$idx];

                            if ($this->braceDepth[$idx] == 0) {

                                // we're out of this context

                                array_pop($this->context);

                            } else if ($this->braceDepth[$idx] < 0) {

                                // bad stuff!

                                if (self::DEBUG_SYNTAX) {

                                    die("Syntax broken in brace close assertion, " . $this->context[count($this->context) - 1] . "\n");

                                }

                                $this->isBroken = true;

                            }

                            break;

                       

                        default:

                            $this->checkLineBreak();

                    }

            }

        }

    }

   

    /**

     * Determines if the currently processing token contains line breaks, and

     * if so, adjusts the lineMap accordingly

     */


    protected function checkLineBreak() {

        // check for new line:

        if (strpos($this->currentValue, "\n") !== false) {

            for ($j=1; $j<=substr_count($this->currentValue, "\n"); $j++) {

                $this->lineMap[] = implode(':', $this->context);

            }

        }

    }

   

    /**

     * Matches the chunk map to the line map

     */


    protected function determineHierarchy() {

        $this->tokenizeHierarchy();

        for ($chunknum=0; $chunknum < count($this->chunks); $chunknum++) {

            $this->chunks[$chunknum]['identifier'] .= ' ' . $this->lineMap[$this->chunks[$chunknum]['line']];

        }

    }

}



new DiffPhp;



// komode: le=unix language=php codepage=utf8 tab=4 notabs indent=4

The most up-to-date version of this file can also be found in my personal svn repostory: https://svn.caedmon.net/svn/public/diff-php/diff-php.

Please let me know if you run into any bugs.. I'm sure there are a few, but it works pretty well for me.

S

Is Pagination Still Necessary?

My first network connection device was a 2400 baud modem. Practically speaking, that would allow me to sustain downloads at a rate of less than 250 bytes per second. This was relatively fast at the time; I'd been using my buddy's 1200 baud modem to connect to local BBSs before that modem-netting birthday.

To put this into perspective, the Yahoo! homepage, all considered, is somewhere around 470kB. On my early-90s era modem, it would have taken a little over 30 minutes (half of one hour) to download (in perfect conditions, without protocol overhead (good ol' zmodem), and if my mom didn't happen to pick up the phone during transfer).

For the past few years, I've had a 10 megabit connection (downstream) into my home/office. Under perfect conditions, I can pull the entire Y! homepage, and all attached media in less than half of one second.

In the early 90s, the Y! homepage was obviously much smaller—all pages were smaller—but even with a smaller footprint, many pages took a long time to load. I remember browsing with many windows open (browsers didn't have tabs back then... in fact, we barely had browsers (-: ), loading up a dozen or so pages before alt-tabbing back to the first one I'd queued up a few minutes before (on my 14.4kbps modem, by this time), to see if it had finally finished loading.

To overcome low connection speeds, lack of resources on the client side, and other factors such as connection latency that lead to slow page page loads, web pioneers came up with a model for allowing content to be delived in reasonable sized chunks that is still in use today: pagination. Long lists of (say "100") pieces of data ("search results") were separated into smaller pages (of "10"), including widgets to allow skipping to the next, previous and often any page in the set.

Well... mostly still in use today.

Technologies have helped us hack around the idea of separating growing amount of data into pages. Ajax, for example, allows the dynamic loading of the next set of results without forcing a page reload (often poorly... try bookmarking the result of many of these dynamic populations. Even Mobile Mail on the iPhone/iPod Touch allows something like this.

It seems to me, though, that web interface designers are stuck in this rut of showing end users a mere 10, 20 or even 100 items at a time. My 10Mb connection can handle a lot more traffic than you're sending; your server had better be able to deliver it (and usually, it can); my browser is allowed to allocate much more RAM; and I even like to think that I've microevolved the ability to parse much more data that I could a few years ago.

So, I ask you, fellow web professionals: is pagination still necessary? I obviously don't think so, but I'm not a User Experience guy, I'm a user (and also the guy who has to make the UX happen, and make sure your server can deliver the results mentioned above). Tell me what you think.

Moved My Blog (and my job)

Hi all.

Quick note to say that I've moved my blog from blog.phpdoc.info to seancoates.com, mostly because I want to shift toward a wider focus than phpdoc.info infers.

I've also changed my blogging software from the horribly slow (it became that way, I suspect partially due to database problems) Serendipity to the shiny and new Habari 0.5-dev.

I've been tinkering with Habari for a couple months now, and I'm really liking both the code and the community.

As you may or may not have already heard (and may or may not care), I've left php|architect and have started a new work-life at OmniTI. So far, I'm really liking it, especially the team of great people over here.

Anyway, I've been putting off blogging for months for various reasons. Hopefully now that one of those reasons (dying blog platform) is out of the way, I'll be able to get back on the wagon.

(BTW, I've taken care to redirect all of my old links to my new blog, but if you happen to find one that doesn't work as expected, please do let me know. Thanks!)

S

A Weak Web of Trust

Every time I'm forced to waste small fractions of my life navigating (and re-navigating) the Air Canada web site, I run into new points of frustration. For example, this week, I couldn't check pricing on a trip because of a JavaScript error that prevented the multi-city page from allowing me to submit the form.

Errors (which have since been fixed) aside, I was finally able to complete my reservation, today, and was reminded of an issue of cross-site trust that I suspect will become more and more of a problem, as sites and businesses continue to deepen their level of cooperation. This type of collaboration can be good or bad for end users, and in this case, what seems beneficial is actually extremely problematic.

The fundamental source of this problem is two-fold: the end-user's inability to know who is receiving trusted information, and the same user's obligation to determine if the identified party should receive this information in the first place.

I've seen it happen in a few places in the past few weeks (my colleague Paul pointed out the Google tie-in that I mention below). I'll comment on these from least- to most-severe/dangerous.

Google

Let's first look at Google. Five years ago (2003), Google acquired Blogger, a blogging service site. Today, if you visit Blogger, you'll be invited to conveniently sign in using your Google Account:

So, what's the problem? It's simple: there's no easy way to tell that Google actually owns Blogger, and that blogger.com should be trusted with your Google credentials. Sure, I know that Blogger is part of the GOOG, and—being up-to-date on things-Web—you probably know... but does your mother? your friends? My wife didn't know.

Indeed, Blogger's main page does say "Copyright © 1999 – 2008 Google" but there's no real, hard link between the two. I could falsely put a similar notice on any of my domains, and it would allow me to steal accounts of anyone who thinks that this is a reasonable practice.

Fortunately, for Blogger users, your gmail account is a relatively low risk (we do use Google docs to plan certain business things that would be considered "confidential" but not necessarily "critically secret.")

Paypal

To step up to what I consider a much more problematic example of "convenient business relationship gone bad" our attention turns to eBay's purchase of Paypal (2002).

I like to browse eBay from time to time, especially to find reasonable prices on brewing stuff. I've won a couple auctions in the past couple months, and I've noticed a very peculiar and dangerous tie-in like the Blogger-Google connection above.

eBay's relationship with Paypal is certainly no secret. I would guess that most eBay regulars generally use Paypal to complete transactions, and many of those are aware that they are, in fact, the same people. Admittedly, this problem might be more or less serious than I'm about to explain, but the fundamental issue is the same—one of trust.

I can't grab a screen shot of this one because I'm unwilling to complete a transaction just for the sake of this blog entry, so you'll have to trust me for this example (or you may have already noticed for yourself). It used to be that when paying a seller via Paypal, you'd be shuttled off to the Paypal site, and returned to eBay upon transaction completion. This is how nearly all Paypal transactions work: merchant passes user off to Paypal to pay, and user is redirected to merchant.

Over the past few weeks (perhaps months, now), there has been a new branding scheme applied to eBay-specific Paypal transactions. When paying, buyers are still (re)directed to paypal.com, but instead of standard Paypal greetings, text, images and colours, users are asked to log into a page that is decorated with eBay's brand (logo, colours, language).

Business-conglomerate aside, this is a very dangerous precedent for Paypal to set. Paypal is understandably one of the biggest targets for phishing scams, and I think it would be in their best interest to keep their site very clearly labeled "Paypal" even if it is "just" eBay. They are quick to attempt to educate their users on the dangers of phishing, and their tips even indicate such now-ambiguous suggestions as "Don't use the same password for PayPal and other online services such as AOL, eBay, MSN, or Yahoo." (Emphasis mine.)

What about sites that LOOK like eBay, but are actually Paypal? Again, I bet that would easily confuse someone who's less Web-savvy.

Visa

Getting back to the problems I had with Air Canada, today, let's discuss the most idiotic and dangerous idea of them all: Verified by Visa.

Verified by Visa is a programme introduced by Visa, in 2001, to help reduce fraudulent credit transactions online by shifting part of the responsibility of preventing fraud from the merchant to the card's issuing bank. The idea is to insert a verification step into an online merchant's purchase process to have a bank essentially vouch for a given card. In this case, Air Canada is the merchant, and Royal Bank of Canada is my issuing bank.

Once again, on the surface, this sounds like a mild inconvenience to end users to create a significant increase in security. In most cases, I believe it does actually do this. Here's my problem: the verification step is inserted into the merchant's page via an iframe. The user is asked for his/her online banking password within this frame, which is actually the issuing bank's web site. I can verify this by loading and inspecting the source, determining that the iframe (probably(!!)) is actually coming from my bank's site (I say "probably" because there COULD be some hard-to-find, obfuscated JavaScript hiding, somewhere that changes this URL and/or loads a different frame/source). One cannot reasonably expect casual users to have the necessary HTML-parsing abilities to determine that it's safe to give this page (that appears to actually be the merchant's site, according to the address bar of my browser, by the way) their online banking password. Again, I'm unwilling to purchase a multi-hundred dollar plane ticket to grab a screen shot to illustrate this point. Sorry (-:

Wrap-up

This whole idea of third-party verification without somehow allowing the user to easily intercept/inspect the process is dangerous and sounds like a ripe venue for increased phishing/social engineering exploits. "Reliably check and/or type the URL yourself (to ensure that it matches the site's content and your intent)" is probably the number-one rule for avoiding phishing scams, and the implementations above make it impossible for casual users to take even the most basic of precautions.

Some tips/rules (in my opinion):

  • You have a URL. It's secured by SSL. Use it. Don't split users off onto different sites. Don't allow login from third-party domains (instead direct the user to your main domain, and securely redirect them back to the main content).
  • Optionally use a system like OpenID (I'm looking at you, Blogger).
  • Don't embed critical information forms into a page hosted on a different domain than one that should be trusted with said information; instead, redirect as above
  • It's bad form to brand brand your trusted domain with a different site's scheme—it's confusing and dangerous.
  • Make your intentions clear to users. Make the recipient of trusted information painfully obvious to the end user, and do so through a mechanism that the user is prone to actually trust—read: use the URL/Address Bar, and not text "don't worry, this form on thanksforthecreditcard.example.com actually submits to paypal.com; you're safe!
  • NEVER expect casual users to know how to figure out where an iframe is sourcing from, or where a form submits.

Google, Paypal, Visa: shame on you. You're violating some of the most fundamental social Web security rules.

How to record a podcast on OSX 10.5.2

I'm so frustrated. It seems that every time we sit down to record the podcast, lately, it all goes to crap, and I'm sick of recording the same thing over and over again only to have it fail (audio gets garbly; drops samples; garageband crashes; kernel panics; all around nasty stuff).

It all seems to stem from Apple seriously screwing up their USB drivers on 10.5.2. This is definitely the first time I've felt seriously let down by my operating system since switching from Linux (which has its own issues) last May.

So, to help all other would-be podcasters out there, I've come up with a chart that helps you choose the proper combination of hardware and software when recording podcasts on 10.5.2:

Seriously, though, if anyone has a real solution to this problem that doesn't involve an OS reinstall (and then not upgrading past 10.5.1), please PLEASE let me know. And no, switching from the left USB port to the right isn't a real solution.

*sob*

S

 1 2 3 … 9 Next →

About

User


Clicky Web Analytics