PHP-Aware Diff

UPDATED (and intentionally reinserted into the feed):

I've made a bunch of changes to this code, and updated it.

It's quite a bit slower, but I really don't care (-:

It uses my new pet project, the tokalizer.

You'll probably want to grab the newly-compiled diff-php as this is the one I'll be "maintaining" (ie, when someone complains, or when it breaks for me).

(end update)

I've told a few people that I'd blog about this "soon" and that was a while ago, so I figured I'd better get on the ball.

I tweeted this almost two weeks ago:

Derick responded saying that diff -p does this for C. I tried it with PHP, and it gave me the outermost block where the change occurred (ie, the class, not the function). The interesting thing, though, is that it changed the @@ line:

@@ -32,7 +32,7 @@ class Foo2 {

Almost what I was looking for, not not quite. I really wanted a php-aware diff that could tell me context.

So, what's a developer with almost no spare time on his hands (but an idea of how to actually accomplish this pet project) to do? Write it himself, of course! (-:

So, I did. Here's an example of the output:

--- tmp/left.php
+++ tmp/right.php
@@ -1,7 +1,7 @@ (root)
 <?php
 class Foo {
     function bar() {
-        // baz!
+        // bax!
     }
 }
 
@@ -32,7 +32,7 @@ (root):Foo2(class)
 // k
 // l
     function bar2() {
-        // baz2!
+        // bax2!
     }
 }
 
@@ -63,7 +63,7 @@ (root):Foo3(class):bar3(function)
 // k
 // l
         $test = "foo {$test}";
-        // baz2!
+        // bax2!
     }
 
     function bar4() {
@@ -93,7 +93,7 @@ (root):Foo3(class):bar4(function):bar5(function)
 // k
 // l
             $test = "foo {$test}";
-            //baz5
+            //bax5
 // a
 // b
 // c

Here's the code for my php-aware diff. I use it as my default svn diff command now (see comments). Hope you find it useful, I sure do.

#!/usr/bin/php
<?php
/// PHP-Aware diff
 
/// Copyright 2008, Sean Coates
///   Usage of the works is permitted provided that this instrument is retained
///   with the works, so that any entity that uses the works is notified of this
///   instrument.
///   DISCLAIMER: THE WORKS ARE WITHOUT WARRANTY.
/// (Fair License - http://www.opensource.org/licenses/fair.php )
/// Short license: do whatever you like with this.
 
 
//// save this file as diff-php
////    and make sure /path/to/diff-php is chmod +x
 
//// TO USE from cli:
////    /path/to/diff-php leftfile rightfile   # (compares files, as diff does)
 
////
//// TO USE from svn:
////    in ~/.subversion/config, add: diff-cmd = /path/to/diff-php
 
//// You might need to adjust DIFF_PATH, below
 
// the tokenizer scares me a bit (-:
 
class DiffPHP {
 
    const DEBUG_SYNTAX = false; // set to true to get syntax error data (== broken diffs)
 
    const DIFF_PATH = '/usr/bin/diff';
    const DIFF_OPTS = '-u';
 
    /**
     * The "left" file, as passed by svn (or cli)
     */
    protected $left;
 
    /**
     * The "right" file, as passed by svn (or cli)
     */
    protected $right;
 
    /**
     * A "nice" version of the left file.
     *
     * Instead of foo/bar/.svn/base/whatever.php, it would just be whatever.php
     */
    protected $niceLeft;
 
    /**
     * A "nice" version of the right file.
     *
     * Instead of foo/bar/.svn/base/whatever.php, it would just be whatever.php
     */
    protected $niceRight;
 
    /**
     * Captured file contents (prevents reading the file twice + diff)
     */
    protected $fileContents;
 
    /**
     * The output from the diff executable
     */
    protected $diff;
 
    /**
     * Each chunk of the diff goes in here (begins with a @@ identifier line)
     */
    protected $chunks;
 
    /**
     * Array of tokens from the Left file
     */
    protected $tokens;
 
    /**
     * Mapping of source lines to source class/functions
     */
    protected $lineMap;
 
    /**
     * Current context (used to construct line map)
     */
    protected $context;
 
    /**
     * Brace depth (used to determine if we're still in the current context)
     */
    protected $braceDepth;
 
    /**
     * Bool flag to indicate that syntax is somehow broken
     */
    protected $isBroken;
 
    /**
     * Object-wide index to keep track of the current token number
     */
    protected $tokenIndex;
 
    /**
     * Currently parsing token value
     */
    protected $currentValue;
 
    /**
     * Constructor. The magic happens here. Once instantiated, the entire
     * process runs
     */
    public function __construct() {
        $this->parseArgs();
 
        $this->fileContents = file_get_contents($this->left);
 
        $this->doDiff();
 
        // subject (probably) IS a PHP file:
        if (!isset($_ENV['NODIFFPHP']) && stripos($this->fileContents, '<?') !== false) {
            $this->splitDiff();
            $this->determineHierarchy();
            $this->reconstructDiff();
        } else {
            // not a PHP file; return regular diff:
            echo $this->diff;
        }
    }
 
    /**
     * Parses the passed arguments.
     *
     * Determines if it's svn (7 args) or cli (2 args), and stores the parsed
     * arguments.
     */
    protected function parseArgs() {
        // if this is being called from svn, we'll get 4 arguments
        //   (8th is argv 0 == this script)
        if (8 == $_SERVER['argc']) {
            $this->niceLeft = $_SERVER['argv'][3];
            $this->niceRight = $_SERVER['argv'][5];
            $this->left = $_SERVER['argv'][6];
            $this->right = $_SERVER['argv'][7];
        } else if (3 == $_SERVER['argc']) {
            // 2 arguments means a regular diff
            $this->niceLeft = $_SERVER['argv'][1];
            $this->niceRight = $_SERVER['argv'][2];
            $this->left = $this->niceLeft;
            $this->right = $this->niceRight;
        } else {
            die("See " . __FILE__ . " for details on how to use this script\n");
        }
    }
 
    /**
     * Calls the external diff program to get the base diff
     */
    protected function doDiff() {
        if (is_readable($this->left) && is_readable($this->right)) {
            $diffCmd = self::DIFF_PATH . ' ' . self::DIFF_OPTS . " {$this->left} {$this->right}";
            $this->diff = `$diffCmd`;
        } else {
            die("{$this->left} or {$this->right} is not readable\n");
        }
    }
 
    /**
     * Takes an identifier line (looks like: @@ -30,23 +30,79 @@) and returns
     * the begin line number
     */
    protected function parseLineNum($identifier) {
        list(,$from) = explode(" ", $identifier);
        list($from) = explode(',', $from);
        return (int) substr($from, 1);
    }
 
    /**
     * Sanitizes CRLF or CR into just LF
     */
    protected function sanitizeLineEndings($data) {
        // first, sanitize line endings:
        $data = str_replace("\r\n", "\n", $data);
        $data = str_replace("\r",   "\n", $data);
        return $data;
    }    
 
    /**
     * Actually splits the diff into chunks and stores chunks + line numbers
     */
    protected function splitDiff() {
        // now split:
        $this->diff = explode("\n", $this->sanitizeLineEndings($this->diff));
 
        // array to return:
        $this->chunks = array();
 
        // line counter
        $line = 0;
 
        // outer loop: file(s)
        $maxLine = count($this->diff);
 
        // skip first 2 lines as left, right files
        $line += 2;
 
        // descend into data chunks
        while ($line < $maxLine) {
            // next line is the chunk identifier
            $dataChunk = array();
            $dataChunk['identifier'] = $this->diff[$line++];
            $dataChunk['line'] = $this->parseLineNum($dataChunk['identifier']);
            $dataChunk['data'] = array();
            while ($line < $maxLine && !(substr($this->diff[$line], 0, 2) == '@@' && substr($this->diff[$line], -2) == '@@')) {
                $dataChunk['data'][] = $this->diff[$line++];
            }
            $this->chunks[] = $dataChunk;
        }
    }
 
    /**
     * Reconstructs the diff (with adjusted identifier lines, and outputs the
     * result)
     */
    protected function reconstructDiff() {
        $out = "--- {$this->niceLeft}\n+++ {$this->niceRight}\n";
        foreach ($this->chunks as $chunk) {
            $out .= $chunk['identifier'] . "\n";
            $out .= implode("\n", $chunk['data']) ."\n";
        }
        echo $out;
    }
 
    /**
     * Descends into a deeper context
     *
     * @param string $type friendly name, either class or function
     */
    protected function enterContext($type) {
        // next comes whitespace:
        if (is_array($this->tokens[++$this->tokenIndex])) {
            list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];
        } else {
            $token = null;
            $this->currentValue = $this->tokens[$this->tokenIndex];
        }
        if ($token != T_WHITESPACE) {
            // syntax is broken, let's get out of here
            if (self::DEBUG_SYNTAX) {
                die("Syntax broken in whitespace assertion, " . $this->context[count($this->context) - 1] . "\n");
            }
            $this->isBroken = true;
            break;
        }
        $this->checkLineBreak();
 
        // next comes the name:
        if (is_array($this->tokens[++$this->tokenIndex])) {
            list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];
        } else {
            $token = null;
            $this->currentValue = $this->tokens[$this->tokenIndex];
        }
        $this->context[] = $this->currentValue . "({$type})";
 
        // chew through the next few tokens until we get a "{"
        while ($this->currentValue != '{' && $this->tokenIndex < count($this->tokens)) {
            if (is_array($this->tokens[++$this->tokenIndex])) {
                list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];
            } else {
                $token = null;
                $this->currentValue = $this->tokens[$this->tokenIndex];
            }
            $this->checkLineBreak();
            switch ($token) {
                // these are all valid before the brace:
                case null:
                case T_WHITESPACE:
                case T_VARIABLE:
                case T_EXTENDS:
                case T_IMPLEMENTS:
                case T_STRING:
                case T_ARRAY:
                case T_CONSTANT_ENCAPSED_STRING:
                case T_LNUMBER:
                case '=':
                    break;
 
                // if another token is found, then there's a syntax error
                // (this was added to prevent really deep looping)
                default:
                    if (self::DEBUG_SYNTAX) {
                        die("Syntax broken in token assertion, " . $this->context[count($this->context) - 1] . "," . token_name($token) . "\n");
                    }
                    $this->isBroken = true;
                    return;
            }
        }
 
        // found the starting brace
        $this->braceDepth[count($this->context) - 1] = 1;
    }    
 
    /**
     * Tokenizes the code and creates a line map
     */
    protected function tokenizeHierarchy() {
        $this->context = array('(root)');
        $this->lineMap = array('');
        $this->tokens = token_get_all($this->sanitizeLineEndings($this->fileContents));
        $this->isBroken = false;
        for ($this->tokenIndex=0; $this->tokenIndex<count($this->tokens); $this->tokenIndex++) {
            if ($this->isBroken) {
                // syntax is somehow broken; return progress, but don't go further
                return;
            }
            if (is_array($this->tokens[$this->tokenIndex])) {
                list($token, $this->currentValue) = $this->tokens[$this->tokenIndex];
            } else {
                $token = null;
                $this->currentValue = $this->tokens[$this->tokenIndex];
                //change here
            }
 
            switch ($token) {
                // check for class
                case T_CLASS:
                    // found "class"
                    $this->enterContext('class');
                    break;
 
                case T_FUNCTION:
                    // found "function"
                    $this->enterContext('function');
                    break;
 
                default:
                    $idx = count($this->context) - 1;
                    switch ($this->currentValue) {
                        case '{':
                        case T_CURLY_OPEN:
                        case T_DOLLAR_OPEN_CURLY_BRACES:
                            ++$this->braceDepth[$idx];
                            break;
 
                        case '}':
                            --$this->braceDepth[$idx];
                            if ($this->braceDepth[$idx] == 0) {
                                // we're out of this context
                                array_pop($this->context);
                            } else if ($this->braceDepth[$idx] < 0) {
                                // bad stuff!
                                if (self::DEBUG_SYNTAX) {
                                    die("Syntax broken in brace close assertion, " . $this->context[count($this->context) - 1] . "\n");
                                }
                                $this->isBroken = true;
                            }
                            break;
 
                        default:
                            $this->checkLineBreak();
                    }
            }
        }
    }
 
    /**
     * Determines if the currently processing token contains line breaks, and
     * if so, adjusts the lineMap accordingly
     */
    protected function checkLineBreak() {
        // check for new line:
        if (strpos($this->currentValue, "\n") !== false) {
            for ($j=1; $j<=substr_count($this->currentValue, "\n"); $j++) {
                $this->lineMap[] = implode(':', $this->context);
            }
        }
    }
 
    /**
     * Matches the chunk map to the line map
     */
    protected function determineHierarchy() {
        $this->tokenizeHierarchy();
        for ($chunknum=0; $chunknum < count($this->chunks); $chunknum++) {
            $this->chunks[$chunknum]['identifier'] .= ' ' . $this->lineMap[$this->chunks[$chunknum]['line']];
        }
    }
}
 
new DiffPhp;
 
// komode: le=unix language=php codepage=utf8 tab=4 notabs indent=4

The most up-to-date version of this file can also be found in my personal svn repostory: https://svn.caedmon.net/svn/public/diff-php/diff-php.

Please let me know if you run into any bugs.. I'm sure there are a few, but it works pretty well for me.

 1

User

You are logged in as Anonymous.

Want to log out?

My friend Paul has a cool service called Wonderproxy that lets you test and develop GeoIP-based apps without the normal headaches. If you need to simulate remote, international traffic, you should check it out.

Clicky Web Analytics