Essential PHP Security

Quite a while ago, O'Reilly sent me a copy of my friend and colleague, Chris Shiflett's book, Essential PHP Security.

When I received it, I read through it quickly, and knew it was a good book, but didn't have much else to say about it (lest I join the ranks of the me too!ers (everyone was saying it's a good book)).

Today, I was wondering about session ID regeneration. I know it's important, but I was looking for a "best practice," or opinion on an appropriate level of session ID regeneration.

After a few quick Web searches, I remembered that I have a copy of the aforementioned book. I respect Chris' opinion on such matters, so I pulled it out of my pile.

A glance at the index shows:

session identifier
obtaining, 43
regenerating at session,  46
regenerating for change in privilege, 46
regenerating on every page, 47

Turns out page 47 contains exactly what I was looking for. It's too long to quote here, but the gist is Regenerate only on privilege escalation, not on every page. Every page works for the most part, but causes problems with the back/forward buttons, and needlessly annoys users.

Thanks, Chris!

PHP Pie?

I've often had to manipulate large blobs of text—no, make that many files containing large blobs of text.

Of course, my IDE can usually handle simple search-and-replace operations, I appreciate the simplicity of the command line interface, on most occasions.

That's one of the reasons I love working in a unixy environment, I think. There's a bunch of utilities that embrace the command line and take simple input and deliver equally simple output. I've employed sed and awk, in the past, and I still use them to perform some very simple parsing. For example, I can often be found doing something like ps auxwww | grep ssh | awk {'print $2'} to get a list of ssh process IDs, for example.

But almost anyone who's ever been enlightened to perl pie delights in its power. In a nutshell, I can do something like perl -p -i -e 's/foo/oof/g' somefile from the command line, and perl will digest every line of somefile and perform the substitution. Perl is very well suited to this type of operation, what with its contextual variables and all.

I updated the code a little, below. You now must explicitly set $_.

Read on for my PHP-based solution (lest planet-php truncate my post). I've often found myself looking for a PHP equivalent. Not to do simple substitutions, of course, but complex ones. And since I'm most comfortable with PHP, and a I have a huge library of snippets that I can dig out to quell a problem that I may have solved years ago, I've been meaning to fill this void for a while.

Tonight, I had to come home from a dinner party, early, because my daughter was sick. Too bad, it looked like it was going to be an amazing feast, but I digress. The home-on-a-Saturday-night time left me with a bit of free time to solve one of the problems that's been floating around in my head for who-knows-how-long.

Thus, I'm happy to present my—at least mostly—working PHP pie script.

#!/usr/bin/php
<?php

// Change the shebang line above to point at your actual PHP interpreter

$interpreter = array_shift($_SERVER['argv']);
$script = array_shift($_SERVER['argv']);
$files = array_filter($_SERVER['argv']);

if (!$script) {
	fwrite(STDERR, "Usage: $interpreter <script> [files]\n");
	fwrite(STDERR, "  Iterates script over every line of every file.\n");
	fwrite(STDERR, "  \$_ contains data from the current line.\n");
	fwrite(STDERR, "  If files are not provided, STDIN/STDOUT will be used.\n");
	fwrite(STDERR, "\n");
	fwrite(STDERR, "  Example: ./pie.php '$_ = preg_replace(\"/foo/\",\"oof\",\$_);' testfile\n");
	fwrite(STDERR, "    Replaces every instance of 'foo' with 'oof' in testfile\n");
	fwrite(STDERR, "\n");
	exit(1);
}

// set up function
$func = create_function('$_', $script .';return $_;');

if (!$files) {
	// no files, use STDIN
	$buf = '';
	while (!feof(STDIN)) {
		$buf .= $func(fgets(STDIN));
	}
	echo $buf;
} else {
	foreach ($files as $f) {
		
		if (!is_dir($f) or !is_writable($f)) {
			fwrite(STDERR, "Can't write to $f (or it's not a file)\n");
			continue;
		}
		
		$buf = '';
		foreach (file($f) as $l) {
			$buf .= $func($l);
		}
		file_put_contents($f, $buf);
	}
}

?>

Hope it helps someone out there.

Update: I've had some people ask me why I'm reinventing the wheel. I did cover this above—I have plenty of existing PHP code snippets, and almost no perl. I also am very comfortable in PHP, but it's been years since I've been comfortable in perl.

Here's an example of something I hacked up, today. I can (relatively) easily turn this:

dmesg | tail -n5

... which returns this:

[17214721.004000] sdc: assuming drive cache: write through
[17214721.004000]  sdc: sdc1
[17214721.024000] sd 7:0:0:0: Attached scsi disk sdc
[17214721.024000] sd 7:0:0:0: Attached scsi generic sg1 type 0
[17214722.464000] FAT: utf8 is not a recommended IO charset for FAT filesystems, filesystem will be case sensitive!

(the first field is the time since boot... useless for my feeble human brain)

into:

dmesg | ./pie.php 'static $prev = false; static $boot = false; if (!$boot) {
list($boot) = explode(" ", file_get_contents("/proc/uptime"));
$boot = time() - (int) $boot;} if (!$_) return; list($ts, $log) = explode(" ", $_, 2);
$ts = str_replace(array("[","]"), array("",""), $ts); $_ = date("H:i:s", $boot + $ts);
if ($prev && ($diff = round($boot + $ts - $prev, 2))) $_ .= " (+". $diff .")"; 
$_ .= " ".$log; $prev = $boot + $ts;' | tail -n 5

(line breaks added for easier reading)... which returns:

17:07:44 sdc: assuming drive cache: write through
17:07:44  sdc: sdc1
17:07:44 (+0.02) sd 7:0:0:0: Attached scsi disk sdc
17:07:44 sd 7:0:0:0: Attached scsi generic sg1 type 0
17:07:45 (+1.44) FAT: utf8 is not a recommended IO charset for FAT filesystems, filesystem will be case sensitive!

That's the sort of thing I wouldn't be comfortable doing in perl, but I hacked up on the command line in PHP.

You can, but you shouldn't

"Here's the mascot," he said, leaning over one of my two half-walls, handing me a file of papers, "the production guys will get you the artwork. The jokes are at the back. Call me when it's ready."

It's early 2000. I'm toiling away in my pseudo-cube in my hometown. I'm a script monkey. My job consists of writing minimal CFML (oh yeah, Coldfusion) wrappers around boring products, like fish hooks.

Denis, the cube-leaning account manager, had tasked me with a project that was mostly impossible (at the time), but moreover, it was a project that simply shouldn't have been done.

The pitch was delivered earlier in the day. "It'll be great! The user will be browsing the website, and the dog (the dalmation mascot) will walk onto the screen and tell a joke!"

Now, remember, this is 2000--at the end of the first browser war; the peak of browser non-compliance; a time when developers were using IE as their primary browser and cringed when forced to test code in Netscape (v4) (as opposed to today, when many developers are using a Mozilla-based browser (Netscape's evolved grandson) and are disgusted by the thought of testing on IE).

Technically speaking, we probably could have rigged up a solution that might have worked on most IE installs, but this gave us a convenient excuse to overthrow the marketing fools: the idea was horrible. Our answer was that there was no technology that would allow us to implement the absurd joke-telling-mascot idea.

Sometimes you can, but you shouldn't. This proverb leads nicely into one of my latest web annoyances.

Like any good technically-minded person, I hold more than the average share of pet peeves. Many are related to software, many more to computing in general. A few are directly related to my area of expertise: the Web.

So, I ask you, my fellow Web developers: WHY do you find it necessary to create your own widgets, when there's a good (albeit limited) toolkit available? Stop it. It drives me crazy.

Want examples? Here you go:

Note from future Sean: these were embedded html objects that died years ago. I'm sure you can imagine horrific scroll bars invented by people who don't really understand how scroll bars work.

Admittedly, the stock HTML widgets might not be as pretty as these custom ones, but they WORK, they're consistent from site-to-site, and you don't have to worry about javascript bugs.

For example, on the select boxes, above (digg.com and saq.com), I can't click the box and press the first letter of my suggestion (like I can with real HTML select boxes). The radio buttons don't honour keyboard input, either -- I can't use the arrow keys to advance. Generally, these hacked-together widgets don't respect the tab key, either.

And if you're using a text based browser (for whatever reason), or perhaps screen reading software, you're pretty much out of luck.

You wouldn't draw each pixel in a line of text would you? Of course not! (unless you're this guy).

In short: your pretty site is no nicer than the joke telling dalmation. Cut it out.

Security and... Driving? (and Hiring)

There's been a blip on the PHP blogosphere (think what you will of that word, it's accurate) regarding PHP's "inherent security flaws."

I guess it's time to toss in my 2c (even though I was one of the first to reply to Chris' post on this). Since I like similes, I propose the following: coding is like driving.

What? It's pretty simple, if you think about it.

If you drive, you'll follow. If you don't, but have tried, you'll also follow. If you've never tried it, you should. (-:

Coding is like driving. When you start driving, you're really bad at it. Everyone is horrible, even if they aren't aware.

As time passes, and you gain more experience behind the wheel, you're subjected to different driving conditions and new hazardous situations. These eventually make most of us better drivers.

Take me, for example. I grew up in a relatively small city in New Brunswick. I learned to drive there. At the time, there was very little street parking, and as a result, very little parallel parking. I was really bad at parallel parking for a long time. I first started driving when I was 16. It wasn't until I was 20 that some friends and I took my car to the first (and only?) Geek Pride Festival. Closing in on Boston, the roads got wider and wider. Suddenly, I found myself driving on a road that was 4 lanes in each direction. You laugh, but this is daunting for a guy who'd never driven on anything wider than 2 lanes (in each direction), before. I knew to cruise on the right, and pass on the left, but... how do I use those other two lanes? I now live in Montreal, and feel confined when there are only two lanes. (-:

Another parallel is when I learned to drive stick (manual transmission). My first few weeks were quite jumpy... then, my clutch foot smoothed out, and my passengers were relieved.

More food for thought lies in the insurace industry. Now, I'll keep my feelings towards these racketeering slimeballs (mostly) to myself for the purposes of this entry, but they DO do something right: reward experienced drivers (often at the cost of young males, but I digress).

I have a motorcycle license. I had to pass both written and driven tests to be able to ride. Even then, I only qualified for the lower class of bike ( 550cc).

Alright, so what's my point? Simple: new coders are bad at their jobs. I thought I was good at the time, but I was horrible. I'm better now, but in 2 years, I know I'll look back at this and think about how bad I was 2 years ago. New drivers are also bad.

So, the people who control the roads have put a few safeguards into effect to keep these people from hurting others. First, there's graduated licensing in many parts of the world. When I was 16, I had a 12 month waiting period before I could drive by myself, and even then, I had to maintain a 0.00% blood alcohol level whenever driving.

Insurance companies penalize (or, if you're fluent in marketing, "don't reward") new drivers. My insurance payments are now an order of magnitude lower than when I first started driving.

Trucking companies are likely to hire newgrad drivers, but this is because their workforce is scarce. They put their better, and more experienced drivers on the most complicated routes. And most taxi drivers I see are well over 30.

Getting offtopic again: New coders are bad. They learn. Some quickly, some not so much. They make mistakes.

So, how do you get around this? Two ways. If you run a small shop, you should ONLY have experienced developers on staff. If your shop is a little bigger, then you can afford (ironically) to pay less to inexperienced devs that can do some grunt work, and get a bit of experience under their belts. Make sure that your good devs are reviewing their work, though.

You're effectively enforcing "graduated licensing" on your devs. If they have little experience, give them little power.

That said, I firmly believe (and agree with Marco) that it's not PHP's job to enforce this. Just as I would not expect Plymouth to limit my ability to drive my old Reliant K car. There are rules in place at a higher level, and that's GOOD in my opinion.

PHP is easy, or at least it starts out that way, and then, after a certain threshold, gets more and more complicated, but that's OK. Everything works this way. "Windows" is easy.. but when your registry pukes, it takes guru skills to clean it up (or novice skills to find your XP CD to reinstall). Driving is "easy"... just don't put new drivers in a situation they haven't seen before (whiteout/blizzard, collision, black ice, blinding sun, etc).

The money you save by hiring new grads (without proper mentors/filtering/etc) is often trumped by your exposure to security flaws, bad design, and failure.

A little aside: development shops and otherwise-hiring companies seem to be catching on to this. In the past 3 months, I've had 4 colleagues (former) come to me asking if I know any advanced PHP devs in Montreal who are looking for work... I've made a few suggestions, but most of the GOOD locals I know are already happily employed. If you live here (or are planning on moving here), and you've got LOTS of PHP experience (more than 3 years), have diverse experience, and are genuinely a good coder, let me know, and I'll try to hook you up.

($var == TRUE) or (TRUE == $var)?

Interesting little trick I picked up a while back, been meaning to blog about it.

Prior to enlightenment, I used to write conditionals something like this:

if ($var == SOME_CONSTANT_CONDITION) {
  // do something
  }

… more specifically:

if ($var == TRUE) {
  // do the true thing
}

That's how I'd "say" it, so that's how I wrote it. But is it the best
way? I now don't think so. When reviewing other peoples' code (often from
C programmers), I've seen "backwards" conditionals.. something like:

```php

if (TRUE == $var) {
  // ...
}

Which just sounds weird. Why would you compare a constant to a variable (you’d normally compare a variable to a constant).

So, what’s the big deal?

Well, a few months back, I stumbled on an old article about a backdoor almost sneaking into Linux.

Here’s the almost-break:


if ((options == (__WCLONE|__WALL)) &amp;&amp; (current-&gt;uid = 0))
  retval = -EINVAL;

Ignore the constants, I don’t know what they mean either. The interesting part is current->uid = 0

See, unless you had your eyes peeled, here, it might look like you’re trying to ensure that current->uid is equal to 0 (uid 0 = root on Linux). So, if options blah blah, AND the user is root, then do something.

But wait. There’s only a single equals sign. The comparison is “==”. “=” is for assignment!

Fortunately, someone with good eyes noticed, and Linux is safe (if this had made it into a release, it would’ve been trivial to escalate your privileges to the root level).. but how many times have you had this happen to you? I’m guilty of accidentally using “=” when I mean “==”. And it’s hard to track down this bug.. it doesn’t LOOK wrong, and the syntax is right, so…

This is nothing new. Everyone knows the = vs == problem. Everyone is over it (most of the time). But how can we reduce this problem?

A simple coding style adjustment can help enormously here.

Consider changing “$var == TRUE” to “TRUE == $var”.

Why? Simple:

sean@iconoclast:~$ php -r '$a = 0; if (FALSE = $a) $b = TRUE;'
Parse error: parse error in Command line code on line 1

Of course, you can’t ASSIGN $a to the constant FALSE. The same style applied above would’ve caused a a similar error in the C linux kernel code:

if ((options == (__WCLONE|__WALL)) && (0 = current-&gt;uid ))

Obviously, “0” is a constant value–you cannot assign a value to it. The missing “=” would’ve popped up right away.

Cool. Seems a little awkward at first, but in practice, it make sense.

HTH.

mail() replacement -- a better hack

This morning, I read Davey's post about how to compile PHP in a way that allows you ro specify your own mail() function. This is kind of a cool hack, but I've been using a different approach for a while, now, that allows much better control. Read on if you're interested.

Davey's hack, if you didn't read his post, yet, centers around defining your OWN mail function, after you have instructed PHP not to build the default one.

My hack doesn't require editing of the PHP source, or even a recompile. It doesn't require an auto-prepend, either, but it does require a small change to php.ini.

So, where's the magic? It lies in the sendmail_path directive.

When it comes to mail() (as well as many other things), PHP prefers to delegate the heavy lifting to another piece of software: sendmail (or a sendmail compatible command-line mail transport agent). By default, PHP will call your sendmail binary, and pass it the entire message, after composing it from the headers and body supplied by the developer.

One of the side-benefits to this system is the ability to override PHP's default, and seamlessly hook in your own sendmailesque binary or script.

Here's an example from one of my development environments:

sendmail_path=/usr/local/bin/logmail
sean@sarcosm:~$ cat /usr/local/bin/logmail
cat >> /tmp/logmail.log

This little bit of config & code is extremely useful in a non-production environment. How many of us have accidentally sent emails to actual customers from the development server? This little bit of trickery avoids this, and instead of sending the email (as PHP normally would), mail is instead logged to the /tmp/logmail.log file. Disaster avoided.

But, that file gets pretty big over time... it becomes unmanageable very quickly. So, in a different environment, I have an alternative:

sendmail_path=/usr/local/bin/trapmail
sean@sarcosm:~$ cat /usr/local/bin/trapmail
formail -R cc X-original-cc \
  -R to X-original-to \
  -R bcc X-original-bcc \
  -f -A"To: devteam@example.com" \
| /usr/sbin/sendmail -t -i

And what does this do? It traps all mail that would normall go OUT (say, to a customer), and instead, delivers it to devteam@example.com (with the original fields renamed for debugging purposes).

So, how does all of this solve Davey's problem?

This is something I whipped up after work, today, so it's pretty new code that likely has a few bugs lurking in it, but it's a good start:sendmail_path=/usr/local/bin/mail_proxy.php

<?php

//---CONFIG
$config = array(
  'host' => 'localhost',
  'port' => 25,
  'auth' => FALSE,
);
$logDir      = '/www/logs/mail';
$logFile     = 'mail_proxy.log';
$failPrefix  = 'fail_';
$EOL         = "\n"; // change to \r\n if you send broken mail
$defaultFrom = '"example.net Webserver" <www@example.net>';
//---END CONFIG

if (!$log = fopen("{$logDir}/{$logFile}", 'a')) {
  die("ERROR: cannot open log file!\n");
}

require('Mail.php'); // PEAR::Mail
if (PEAR::isError($Mailer = Mail::factory('SMTP', $config))) {
  fwrite($log, ts() . "Failed to create PEAR::Mail object\n");
  fclose($log);
  die();
}

// get headers/body
$stdin = fopen('php://stdin', 'r');
$in = '';
while (!feof($stdin)) {
  $in .= fread($stdin, 1024); // read 1kB at a time
}

list ($headers, $body) = explode("$EOL$EOL", $in, 2);

$recipients = array();
$headers = explode($EOL, $headers);
$mailHdrs = array();
$lastHdr = false;
$recipFields = array('to','cc','bcc');
foreach ($headers AS $h) {
  if (!preg_match('/^[a-z]/i', $h)) {
    if ($lastHdr) {
      $lastHdr .= "\n$h";
    }
    // skip this line, doesn't start with a letter
    continue;
  }
  list($field, $val) = explode(': ', $h, 2);
  if (isset($mailHdrs[$field])) {
    $mailHdrs[$field] = (array) $mailHdrs[$field];
    $mailHdrs[$field][] = $val;
  } else {
    $mailHdrs[$field] = $val;
  }
  if (in_array(strtolower($field), $recipFields)) {
    if (preg_match_all('/[^ ;,]+@[^ ;,]+/', $val, $m)) {
      $recipients = array_merge($recipients, $m[0]);;
    }
  }
}
if (!isset($mailHdrs['From'])) {
  $mailHdrs['From'] = $defaultFrom;
}

$recipients = array_unique($recipients); // remove dupes

// send
if (PEAR::isError($send = $Mailer->send($recipients, $mailHdrs, $body))) {
  $fn = uniqid($failPrefix);
  file_put_contents("{$logDir}/{$fn}", $in);
  fwrite($log, ts() ."Error sending mail: $fn (". $send->getMessage() .")\n");
  $ret = 1; // fail
} else {
  fwrite($log, ts() ."Mail sent ". count($recipients) ." recipients.\n");
  $ret = 0; // success
}
fclose($log);
return $ret;

//////////////////////////////

function ts()
{
  return '['. date('y.m.d H:i:s') .'] ';
}

?>

Voila. SMTP mail from a unix box that may or may not have a MTA (like sendmail) installed.

Don't forget to change the CONFIG block.

XSS Woes

A predominant PHP developer (whose name I didn't get permission to drop, so I won't, but many of you know who I mean) has been doing a bunch of research related to Cross Site Scripting (XSS), lately. It's really opened opened my eyes to how much I take user input for granted.

Don't get me wrong. I write by the "never trust users" mantra. The issue, in this case, is something abusable that completely slipped under my radar.

Most developers worth their paycheque, I'm sure, know the common rules of "never trust the user", such as "escape all user-supplied data on output," "always validate user input," and "don't rely on something not in your control to do so (ie. Javascript cannot be trusted)." "Don't output unescaped input" goes without saying, in most cases. Only a fool would "echo $_GET['param'];" (and we're all foolish sometimes, aren't we?).

The problem that was demonstrated to me exploited something I considered to be safe. The filename portion of request URI. Now I know just how wrong I was.

Consider this: you build a simple script; let's call it simple.php but that doesn't really matter. simple.php looks something like this:

<html>
 <body>
  <?php
  if (isset($_REQUEST['submitted']) && $_REQUEST['submitted'] == '1') {
    echo "Form submitted!";
  }
  ?>
  <form action="<?php echo $_SERVER['PHP_SELF']; ?>">
   <input type="hidden" name="submitted" value="1" />
   <input type="submit" value="Submit!" />
  </form>
 </body>
</html>

Alright. Let's put this script at: http://example.com/tests/simple.php. On a properly-configured web server, you would expect the script to always render to this, on request:

<html>
 <body>
  <form action="/tests/simple.php">
   <input type="hidden" name="submitted" value="1" />
   <input type="submit" value="Submit!" />
  </form>
 </body>
</html>

Right? No.

What I forgot about, as I suspect some of you have, too (or maybe I'm the only loser who didn't think of this (-; ), is that $_SERVER['PHP_SELF'] can be manipulated by the user.

How's that? If I put a script at /simple/test.php, $_SERVER['PHP_SELF'] should always be "/simple/test.php", right?

Wrong, again.

See, there's a feature of Apache (I think it's Apache, anyway) that you may have used for things like short URLs, or to optimize your query-string-heavy website to make it search-engine friendly. $_SERVER['PATH_INFO']-based URLs.

Quickly, this is when scripts are able to receive data in the GET string, but before the question mark that separates the file name from the parameters. In a URL like http://www.example.com/download.php/path/to/file, download.php would be

executed, and /path/to/file would (usually, depending on config) be available to the script via $_SERVER['PATH_INFO'].

The quirk is that $_SERVER['PHP_SELF'] contains this extra data, opening up the door to potential attack. Even something as simple the code above is vulnerable to such exploits.

Let's look at our simple.php script, again, but requested in a slightly different manner: http://example.com/tests/simple.php/extra_data_here

It would still "work"--the output, in this case, would be:

<html>
 <body>
  <form action="/tests/simple.php/extra_data_here">
   <input type="hidden" name="submitted" value="1" />
   <input type="submit" value="Submit!" />
  </form>
 </body>
</html>

I hope that the problem is now obvious. Consider: http://example.com/tests/simple.php/%22%3E%3Cscript%3Ealert('xss')%3C/script%3E%3Cfoo

The output suddenly becomes very alarming:

<html>
 <body>
  <form action="/tests/simple.php/"><script>alert('xss')</script><foo">
   <input type="hidden" name="submitted" value="1" />
   <input type="submit" value="Submit!" />
  </form>
 </body>
</html>

If you ignore the obviously-incorrect <foo"> tag, you'll see what's happening. The would-be attacker has successfully exploited a critical (if you consider XSS critical) flaw in your logic, and, by getting a user to click the link (even through a redirect script), he has executed the Javascript of his choice on your user's client (obviously, this requires the user to have Javascript enabled). My alert() example is non-malicious, but it's trivial to write similarly-invoked Javascript that changes the action of a form, or usurps cookies (and submits them in a hidden iframe, or through an image tag's URL, to a server that records this personal data).

The solution should also be obvious. Convert the user-supplied data to entities. The code becomes:

<html>
 <body>
  <?php
  if (isset($_REQUEST['submitted']) && $_REQUEST['submitted'] == '1') {
    echo "Form submitted!";
  }
  ?>
  <form action="<?php echo htmlentities($_SERVER['PHP_SELF']); ?>">
   <input type="hidden" name="submitted" value="1" />
   <input type="submit" value="Submit!" />
  </form>
 </body>
</html>

And an attack, as above, would be rendered:

<html>
 <body>
  <form action="/tests/simple.php/&amp;quot;&amp;gt;&amp;lt;script&amp;gt;alert('xss')&amp;lt;/script&amp;gt;&amp;lt;foo">
   <input type="hidden" name="submitted" value="1" />
   <input type="submit" value="Submit!" />
  </form>
 </body>
</html>

This still violates the assumption that the script name and path are the only data in $_SERVER['PHP_SELF'], but the payload has been neutralized.

Needless to say, I felt silly for not thinking of such a simple exploit, earlier. As the aforementioned PHP developer said, at the time (to paraphrase): if guys who consider themselves experts in PHP development don't notice these things, there's little hope for the unwashed masses who have just written their first 'echo "hello world!\n";'. He's working on a generic user-input filtering mechanism that can be applied globally to all user input. Hopefully we'll see it in PECL, soon. Don't forget about the other data in $_SERVER, either..

... ...

Upon experimenting with this exploit on my own server (and watching the raw data in my _SUPERGLOBALS, conveniently, via phpinfo()), I noticed something very interesting that reminded me that even though trusting this data was a stupid mistake on my part, I'm not the only one to do so. A fun (and by fun, I mean nauseating) little game to play: create a file called "info.php" (or whatever name you like). In it, place only "<php phpinfo(); ?>". Now request it like this: http://your-server/path/to/info.php/%22%3E%3Cimg%20src=http://www.perl.com/images/75-logo.jpg%3E%3Cblah

Nice huh? A little less nauseating: it's fixed in CVS.

Fun with the tokenizer...

I was reminded, this past week, of how cool the tokenizer is.

One of the guys who works in the same office as I do had what seemed to be a simple problem: he had a php file that contained ~50 functions, and wanted to summarize the API without parsing through the file, manually, and cutting out the function declarations.

We introduced him to in-line phpdoc blocks (he works (as a Jr.-level PHP developer) in the same office, but for a different company, so he doesn't have to follow our coding standards, but I digress..), but the 50-function library in question didn't have docblocks.

Sure, he could (and did) pull up a list function NAMES with get_defined_functions (I assume by using array_diff against a before-and-after capture), but this didn't give him the argument names, or even the number of arguments for a given function, so I broke out some old tokenizer code I'd written.

In case you aren't familiar with the tokenizer, the PHP manual defines it as:

“[an interface to let you write] your own PHP source analyzing or modification tools without having to deal with the language specification at the lexical level.”

The extension (which has been part of the PHP core distribution since 4.3.0) consists only of two functions: token_get_all and token_name, and a boatload of constants.

Enough babble, though, let's get to the meat. I pulled out this code I'd written for PEARClops (on EFNet #PEAR) that parses PHP source files and figures out what classes, functions/methods and associated parameters are included.

<?php

function get_protos($in)
{
  if (is_file(realpath($in)))
  {
    $in = file_get_contents($in);
  }
  $tokens = token_get_all($in);
  $funcs = array();
  $currClass = '';
  $classDepth = 0;

  for ($i=0; $i<count($tokens); $i++)
  {
    if (is_array($tokens[$i]) && $tokens[$i][0] == T_CLASS)
    {
      ++$i; // whitespace;
      $currClass = $tokens[++$i][1];
      while ($tokens[++$i] != '{') {}
      ++$i;
      $classDepth = 1;
      continue;
    }
    elseif (is_array($tokens[$i]) && $tokens[$i][0] == T_FUNCTION)
    {
      $nextByRef = FALSE;
      $thisFunc = array();
      
      while ($tokens[++$i] != ')')
      {
        if (is_array($tokens[$i]) && $tokens[$i][0] != T_WHITESPACE)
        {
          if (!$thisFunc)
          {
            $thisFunc = array(
              'name'  => $tokens[$i][1],
              'class' => $currClass,
            );
          }
          else
          {
            $thisFunc['params'][] = array(
              'byRef'   => $nextByRef,
              'name'    => $tokens[$i][1],
            );
            $nextByRef = FALSE;
          }
        }
        elseif ($tokens[$i] == '&')
        {
          $nextByRef = TRUE;
        }
        elseif ($tokens[$i] == '=')
        {
          while (!in_array($tokens[++$i], array(')',',')))
          {
            if ($tokens[$i][0] != T_WHITESPACE)
            {
              break;
            }
          }
          $thisFunc['params'][count($thisFunc['params']) - 1]['default'] = $tokens[$i][1];
        }
      }
      $funcs[] = $thisFunc;
    }
    elseif ($tokens[$i] == '{')
    {
      ++$classDepth;
    }
    elseif ($tokens[$i] == '}')
    {
      --$classDepth;
    }

    if ($classDepth == 0)
    {
      $currClass = '';
    }
  }

  return $funcs;
}

function parse_protos($funcs)
{  
  $protos = array();
  foreach ($funcs AS $funcData)
  {
    $proto = '';
    if ($funcData['class'])
    {
      $proto .= $funcData['class'];
      $proto .= '::';
    }
    $proto .= $funcData['name'];
    $proto .= '(';
    if ($funcData['params'])
    {
      $isFirst = TRUE;
      foreach ($funcData['params'] AS $param)
      {
        if ($isFirst)
        {
          $isFirst = FALSE;
        }
        else
        {
          $proto .= ', ';
        }

        if ($param['byRef'])
        {
          $proto .= '&';
        }
        $proto .= $param['name'];
      }
    }
    $proto .= ")";
    $protos[] = $proto;
  }
  return $protos;
}

echo "Functions in {$_SERVER['argv'][1]}:\n";
foreach (parse_protos(get_protos($_SERVER['argv'][1])) AS $proto)
{
  echo "  $proto\n";
}
?>

Save it as "parse_funcs.php" (or whatever you like) and call it like so: php parse_funcs.php /path/to/php_file

For instance:

sean@iconoclast:~/php/scripts$ php token_funcs_cli.php ~/php/cvs/Mail_Mime/mime.php
Functions in /home/sean/php/cvs/Mail_Mime/mime.php:
  Mail_mime::Mail_mime($crlf)
  Mail_mime::__wakeup()
  Mail_mime::setTXTBody($data, $isfile, $append)
  Mail_mime::setHTMLBody($data, $isfile)
  Mail_mime::addHTMLImage($file, $c_type, $name, $isfilename)
  Mail_mime::addAttachment($file, $c_type, $name, $isfilename, $encoding)
  Mail_mime::_file2str(&$file_name)
  Mail_mime::_addTextPart(&$obj, $text)
  Mail_mime::_addHtmlPart(&$obj)
  Mail_mime::_addMixedPart()
  Mail_mime::_addAlternativePart(&$obj)
  Mail_mime::_addRelatedPart(&$obj)
  Mail_mime::_addHtmlImagePart(&$obj, $value)
  Mail_mime::_addAttachmentPart(&$obj, $value)
  Mail_mime::get(&$build_params)
  Mail_mime::headers(&$xtra_headers)
  Mail_mime::txtHeaders($xtra_headers)
  Mail_mime::setSubject($subject)
  Mail_mime::setFrom($email)
  Mail_mime::addCc($email)
  Mail_mime::addBcc($email)
  Mail_mime::_encodeHeaders($input)
  Mail_mime::_setEOL($eol)

Not bad, huh?

There are some not-so-obvious bugs (inheritance, mostly), but for a relatively short script, it does a pretty good job.

This post doesn’t exactly fit the normal theme of my blog, but over the past few weeks, several people have asked me about this, so I thought it was worth jotting down a few notes.

In January 2021, after eyeballing the specs and possibilities for the past few months, I splurged and ordered the Anova Precision Oven. I’ve owned it for over a year, now, and I use it a lot. But I wish it was quite a bit better.

There were a few main features of the APO that had me interested.

First, we have a really nice Wolf stove that came with our house. The range hood is wonderful, and the burners are great. The oven is also good when we actually need it (and we do still need it, sometimes; see below), but it’s propane, so there are a few drawbacks: it takes a while to heat up because there’s a smart safety feature that’s basically a glow plug that won’t let gas flow until it’s built up enough heat to ignite the gas, preventing a situation where the oven has an ideal gas-air mix and is ready to explode. It’s also big. And it uses propane (which I love for the burners, but is unnecessary (mostly) for the oven, and not only is it relatively expensive to run (we have a good price on electricity in Quebec because of past investments in giant hydro-electric projects), it measurably reduces the air quality in the house if the hood fan isn’t running (and running the fan in the dead of winter or summer cools/heats the house in opposition to our preference).

The second feature that had me really interested in the APO is the steam. I’ve tried and mostly-failed many times to get my big oven (this gas one and my previous electric oven) to act like a steam oven. Despite trying the tricks like a pan of water to act as a hydration reservoir, and spraying the walls with a mist of water, it never really steamed like I’d hoped—especially when making baguette.

I’m happy to say that the APO meets both of these needs very well: it’s pretty quick to heat up—mostly because it’s smaller; I do think it’s under-powered (see below)—and the steam works great.

There are, however, a bunch of things wrong with the APO.

The first thing I noticed, after unpacking it and setting it up the first time, is that it doesn’t fit a half sheet pan. It almost fits. I’m sure there was a design or logistics restriction (like maybe these things fit significantly more on a pallet or container when shipping), but sheet pans come in standard sizes, and it’s a real bummer that I not only can’t use the pans (and silicone mats) I already owned, but finding the right sized pan for the APO is also difficult (I bought some quarter and eighth sheet pans, but they don’t fill up the space very well).

Speaking of the pan: the oven comes with one. That one, however, was unusable. It’s made in such a way that it warps when it gets hot. Not just a little bit—a LOT. So much that if there happens to be liquid on the pan, it will launch that liquid off of the pan and onto the walls of the oven when the pan deforms abruptly. Even solids are problematic on the stock pan. I noticed other people complaining online about this and that they had Anova Support send them a new pan. I tried this. Support was great, but the pan they sent is unusable in a different way: they “solved” the warping problem by adding rigidity to the flat bottom part of the pan by pressing ribs into it. This makes the pan impossible to use for baking anything flat like bread or cookies.

I had to contact Support again a few months later when the water tank (the oven uses this for steam, but also even when steam mode is 0%, to improve the temperature reading by feeding some of the water to the thermometer, in order to read the “wet bulb” temperature). The tank didn’t leak, but the clear plastic cracked in quite a large pattern, threatening to dump several litres of water all over my kitchen at any moment. Support sent me a new tank without asking many questions. Hopefully the new one holds up; it hasn’t cracked yet, after ~3 months.

Let’s talk about the steam for a moment: it’s great. I can get a wonderful texture on my breads by cranking it up, and it’s perfect for reheating foods that are prone to drying out, such as mac & cheese—it’s even ideal to run a small amount of steam for reheating pizza that might be a day or two too old. I rarely use our microwave oven for anything non-liquid (melting butter, reheating soups), and the APO is a great alternative way to reheat leftovers (slower than the microwave, sure, but it doesn’t turn foods into rubber, so it’s worth trading time for texture).

So it’s good for breads? Well, sort of. The steam is great for the crust, definitely. However, it has a couple problems. I mentioned above that it’s under-powered, and what I mean by that is two-fold: it has a maximum temperature of 250°C (482°F), and takes quite a long time to recover from the door opening—like, 10 minutes long. Both of these are detrimental to making an ideal bread. I’d normally bake bread at a much higher temperature—I do 550°F in the big oven, and pizza even hotter (especially in the outdoor pizza oven which easily gets up to >800°F). 482°F is—at least in my casual reasoning—pretty bad for “oven spring”. My baguettes look (and taste) great, but they’re always a bit too flat. The crust forms, but the steam bubbles don’t expand quite fast enough to get the loaf to inflate how I’d like. The recovery time certainly doesn’t help with this, either. I’ve managed to mitigate the slow-reheat problem by stacking a bunch of my cast iron pans in the oven to act as a sort of thermal ballast, and help the oven recover more quickly.

Also on the subject of bread: the oven is great for proofing/rising yeast doughs. Well, mostly great. It does a good job of holding the oven a bit warmer than my sometimes-cold-in-winter kitchen, and even without turning on the steam, it seems to avoid drying out the rising dough. I say “mostly” because one of the oven’s fans turns on whenever the oven is “on”, even at low temperatures. The oven has a pretty strong convection fan which is great, but this one seems to be the fan that cools the electronics. I realize this is necessary when running the oven itself, but it’s pretty annoying for the kitchen to have a fairly-loud fan running for 24-48+ hours while baguette dough is rising at near-ambient temperatures.

The oven has several “modes” where you can turn on different heating elements inside the oven. The main element is the “rear” one, which requires convection, but there’s a lower-power bottom element that’s best for proofing, and a top burner that works acceptably (it’s much less powerful than my big gas oven, for example) for broiling. One huge drawback to the default rear+convection mode, though, is that the oven blows a LOT of bubbling liquid all over the place when it’s operating. This means that it gets really dirty, really quickly (see the back wall in the photo with the warped pan, above). Much faster than my big oven (even when running the convection fan over there). This isn’t the end of the world, but it can be annoying.

The oven has controls on the door, as well as an app that works over WiFi (locally, and even when remote). I normally don’t want my appliances to be in the Internet (see Internet-Optional Things), but the door controls are pretty rough. The speed-up/slow-down algorithm they use when holding the buttons for temperature changes is painful. It always overshoots or goes way too slow. They’ve improved this slightly, with a firmware update, but it’s still rough.

The app is a tiny bit better, but it has all of the problems you might expect from a platform-agnostic mobile app that’s clearly built on a questionable web framework. The UI is rough. It always defaults to the wrong mode for me (I rarely use the sous-vide mode), and doesn’t seem to allow things like realtime temperature changes without adding a “stage” and then telling the oven to go to that stage. It’s also dangerous: you can tell the app to turn the oven on, without any sort of “did one of the kids leave something that’s going to catch fire inside the oven” interlock. I’d much prefer (even as optional configuration) a mode where I’m required to open and close the door within 90 seconds of turning the oven on, or it will turn off, or something like that.

Speaking of firmware… one night last summer, while I was sitting outside doing some work, my partner sent me a message “did you just do something to the oven? it keeps making the sound like it’s just turned on.” I checked the app and sure enough, it just did a firmware update. I told her “it’s probably just restarted after the firmware update.” When I went inside a little while later, I could hear it making the “ready” chime over and over. Every 10-15 seconds or so. I didn’t realize this is what she’d meant. I tried everything to get it to stop, but it was in a reboot loop. We had to unplug it to save our sanity. Again, I looked online to see if others were having this issue, and sure enough, there were thousands of complaints about how everyone’s ovens were doing this same thing. Some people were about to cook dinner, others had been rising bread for cooking that night, but we all had unusable ovens. They’d just reboot over and over, thanks to a botched (and automatic!) firmware update. Anova fixed this by the next morning, but it was a good reminder that software is terrible, and maybe our appliances shouldn’t be on the Internet. (I’ve since put it back online because of the aforementioned door controls and the convenience of the—even substandard—app. I wish we could just use the door better, though.)

So, should you buy it? Well, I don’t know. Truthfully, I’m happy we have this in our house. It’s definitely become our main oven, and it fits well in our kitchen (it’s kind of big, but we had a part of the counter top that turned out perfect for this). It needs its own circuit, really, and is still underpowered at 120V (~1800W). However, I very very often feel like I paid a lot of money to beta test a product for Anova (it was around the same price as I paid for my whole slide-in stove (oven + burners, “range”), at the previous house), and that’s a bummer.

If they announce a Version 2 that fixes the problems, I’d definitely suggest getting that, or even V1 if you need it sooner, and are willing to deal with the drawbacks—I just wish you didn’t have to.

In the previous post, we talked about Python serverless architectures on Amazon Web Services with Zappa.

In addition to the previously-mentioned benefits of being able to concentrate directly on the code of apps we’re building, instead of spending effort on running and maintaining servers, we get a few other new tricks. One good example of this is that we can allow our developers to deploy (Zappa calls this update) to shared dev and QA environments, directly, without having to involve anyone from ops (more on this in another post), nor even do we require a build/CI system to push out these types of builds.

That said, we do use a CI serversystem for this project, but it differs from our traditional setup. In the past, we used Jenkins, but found it a bit too heavy. Our current non-Lambda setup uses Buildbot to do full integration testing (it not only runs our apps’ test suites, but it also spins up EC2 nodes, provisions them with Salt, and makes sure they pass the same health checks that our load balancers use to ensure the nodes should receive user requests).

On this new architecture, we still have a test suite, of course, but there are no nodes to spin up (Lambda handles this for us), no systems to provision (the “nodes” are containers that hold only our app, Amazon’s defaults, and Zappa’s bootstrap), and not even any load balancers to keep healthy (this is API Gateway’s job).

In short, our tests and builds are simpler now, so we went looking for a simpler system. Plus, we didn’t want to have to run one or more servers for CI if we’re not even running any (permanent) servers for production.

So, we found LambCI. It’s not a platform we would normally have chosen—we do quite a bit of JavaScript internally, but we don’t currently run any other Node.js apps. It turns out that the platform doesn’t really matter for this, though.

LambCI (as you might have guessed from the name) also runs on Lambda. It requires no permanent infrastructure, and it was actually a breeze to set up, thanks to its CloudFormation template. It ties into GitHub (via AWS SNS), and handles core duties like checking out the code, runing the suite only when configured to do so, and storing the build’s output in S3. It’s a little bit magical—the good kind of magic.

It’s also very generic. It comes with some basic bootstrapping infrastructure, but otherwise relies primarily on configuration that you store in your Git repository. We store our build script there, too, so it’s easy to maintain. Here’s what our build script (do_ci_build) looks like (I’ve edited it a bit for this post):

#!/bin/bash

# more on this in a future post
export PYTHONDONTWRITEBYTECODE=1

# run our test suite with tox and capture its return value
pip install --user tox && tox
tox_ret=$?

# if tox fails, we're done
if [ $tox_ret -ne 0 ]; then
    echo "Tox didn't exit cleanly."
    exit $tox_ret
fi

echo "Tox exited cleanly."

set -x

# use LAMBCI_BRANCH unless LAMBCI_CHECKOUT_BRANCH is set
# this is because lambci considers a PR against master to be the PR branch
BRANCH=$LAMBCI_BRANCH
if [[ ! -z "$LAMBCI_CHECKOUT_BRANCH" ]]; then
    BRANCH=$LAMBCI_CHECKOUT_BRANCH
fi

# only do the `zappa update` for these branches
case $BRANCH in
    master)
        STAGE=dev
        ;;
    qa)
        STAGE=qa
        ;;
    staging)
        STAGE=staging
        ;;
    production)
        STAGE=production
        ;;
    *)
        echo "Not doing zappa update. (branch is $BRANCH)"
        exit $tox_ret
        ;;
esac

echo "Attempting zappa update. Stage: $STAGE"

# we remove these so they don't end up in the deployment zip
rm -r .tox/ .coverage

# virtualenv is needed for Zappa
pip install --user --upgrade virtualenv

# now build the venv
virtualenv /tmp/venv
. /tmp/venv/bin/activate

# set up our virtual environment from our requirements.txt
/tmp/venv/bin/pip install --upgrade -r requirements.txt --ignore-installed

# we use the IAM profile on this lambda container, but the default region is
# not part of that, so set it explicitly here:
export AWS_DEFAULT_REGION='us-east-1'

# do the zappa update; STAGE is set above and zappa is in the active virtualenv
zappa update $STAGE

# capture this value (and in this version we immediately return it)
zappa_ret=$?
exit $zappa_ret

This script, combined with our .lambci.json configuration file (also stored in the repository, as mentioned, and read by LambCI on checkout) is pretty much all we need:

{
    "cmd": "./do_ci_build",
    "branches": {
        "master": true,
        "qa": true,
        "staging": true,
        "production": true
    },
    "notifications": {
        "sns": {
            "topicArn": "arn:aws:sns:us-east-1:ACCOUNTNUMBER:TOPICNAME"
        }
    }
}

With this setup, our test suite runs automatically on the selected branches (and on pull request branches in GitHub), and if that’s successful, it conditionally does a zappa update (which builds and deploys the code to existing stages).

Oh, and one of the best parts: we only pay for builds when they run. We’re not paying hourly for a CI server to sit around doing nothing on the weekend, overnight, or when it’s otherwise idle.

There are a few limitations (such as a time limit on lambda functions, which means that the test suite + build must run within that time limit), but frankly, those haven’t been a problem yet.

If you need simple builds/CI, it might be exactly what you need.