1. Fun with the tokenizer...

    I was reminded, this past week, of how cool the tokenizer is.

    One of the guys who works in the same office as I do had what seemed to be a simple problem: he had a php file that contained ~50 functions, and wanted to summarize the API without parsing through the file, manually, and cutting out the function declarations.

    We introduced him to in-line phpdoc blocks (he works (as a Jr.-level PHP developer) in the same office, but for a different company, so he doesn't have to follow our coding standards, but I digress..), but the 50-function library in question didn't have docblocks.

    Sure, he could (and did) pull up a list function NAMES with get_defined_functions (I assume by using array_diff against a before-and-after capture), but this didn't give him the argument names, or even the number of arguments for a given function, so I broke out some old tokenizer code I'd written.

    In case you aren't familiar with the tokenizer, the PHP manual defines it as:

    “[an interface to let you write] your own PHP source analyzing or modification tools without having to deal with the language specification at the lexical level.”

    The extension (which has been part of the PHP core distribution since 4.3.0) consists only of two functions: token_get_all and token_name, and a boatload of constants.

    Enough babble, though, let's get to the meat. I pulled out this code I'd written for PEARClops (on EFNet #PEAR) that parses PHP source files and figures out what classes, functions/methods and associated parameters are included.

    [php] $tokens[$i][1], 'class' => $currClass, ); } else { $thisFunc['params'][] = array( 'byRef' => $nextByRef, 'name' => $tokens[$i][1], ); $nextByRef = FALSE; } } elseif ($tokens[$i] == '&') { $nextByRef = TRUE; } elseif ($tokens[$i] == '=') { while (!in_array($tokens[++$i], array(')',','))) { if ($tokens[$i][0] != T_WHITESPACE) { break; } } $thisFunc['params'][count($thisFunc['params']) - 1]['default'] = $tokens[$i][1]; } } $funcs[] = $thisFunc; } elseif ($tokens[$i] == '{') { ++$classDepth; } elseif ($tokens[$i] == '}') { --$classDepth; }

    if ($classDepth == 0) { $currClass = ''; } }

    return $funcs; }

    function parse_protos($funcs) { $protos = array(); foreach ($funcs AS $funcData) { $proto = ''; if ($funcData['class']) { $proto .= $funcData['class']; $proto .= '::'; } $proto .= $funcData['name']; $proto .= '('; if ($funcData['params']) { $isFirst = TRUE; foreach ($funcData['params'] AS $param) { if ($isFirst) { $isFirst = FALSE; } else { $proto .= ', '; } if ($param['byRef']) { $proto .= '&'; } $proto .= $param['name']; } } $proto .= ")"; $protos[] = $proto; } return $protos; } echo "Functions in {$_SERVER['argv'][1]}:\n"; foreach (parse_protos(get_protos($_SERVER['argv'][1])) AS $proto) { echo " $proto\n"; } ?> [/php]

    Save it as "parse_funcs.php" (or whatever you like) and call it like so: php parse_funcs.php /path/to/php_file

    For instance: [code] sean@iconoclast:~/php/scripts$ php token_funcs_cli.php ~/php/cvs/Mail_Mime/mime.php Functions in /home/sean/php/cvs/Mail_Mime/mime.php: Mail_mime::Mail_mime($crlf) Mail_mime::__wakeup() Mail_mime::setTXTBody($data, $isfile, $append) Mail_mime::setHTMLBody($data, $isfile) Mail_mime::addHTMLImage($file, $c_type, $name, $isfilename) Mail_mime::addAttachment($file, $c_type, $name, $isfilename, $encoding) Mail_mime::_file2str(&$file_name) Mail_mime::_addTextPart(&$obj, $text) Mail_mime::_addHtmlPart(&$obj) Mail_mime::_addMixedPart() Mail_mime::_addAlternativePart(&$obj) Mail_mime::_addRelatedPart(&$obj) Mail_mime::_addHtmlImagePart(&$obj, $value) Mail_mime::_addAttachmentPart(&$obj, $value) Mail_mime::get(&$build_params) Mail_mime::headers(&$xtra_headers) Mail_mime::txtHeaders($xtra_headers) Mail_mime::setSubject($subject) Mail_mime::setFrom($email) Mail_mime::addCc($email) Mail_mime::addBcc($email) Mail_mime::_encodeHeaders($input) Mail_mime::_setEOL($eol) [/code]

    Not bad, huh?

    There are some not-so-obvious bugs (inheritance, mostly), but for a relatively short script, it does a pretty good job.

    4 Responses

    Feed for this Entry
    • I have done several projects with the Tokenizer now, and am quite familiar with it (PHP_CompatInfo and my PHP Documented Source script).

      Greg (Beaver) convinced me that the Reflection api in PHP5 is vastly superior and quicker. You should check it out if you haven't already.

      http://www.php.net/manual/en/language.oop5.reflection.php

      - Davey

    • Sean Coates

      2005 Mar 07 01:46

      I'm not overly familiar with the reflection API, however, I forgot to mention that the developer-in-question is PHP 4, only.

      Good tip, though.

      S

    • You need PHP5 only for reflecting so it should be possible to parse the PHP4 code and create some nice output.

    • Hi,
      The script is missing T_CURLY_OPEN and T_DOLLAR_OPEN_CURLY_BRACES which come instead of '{' in the strings variables. Closing is allways '}'
      à+