UTF: WTF?

2008-Nov-23

Note: This article first ran in php|architect in March 2008, while I still worked at MTA. Marco (the publisher, and my former colleague) has graciously agreed to allow me to republish this in a more public forum. I've wanted to link a few people to it in the past few months and until now that was only possible if they were php|architect subscribers. That said, if you're into PHP, you really should subscribe to php|a.

As you might know, one of my roles at php|architect is to organize and manage speakers (and their talks) for our PHP conferences.

A while back, PHP 6's main proponent, Andrei Zmievski, submitted a talk that we accepted, entitled "I ♥ Unicode, You ♥ Unicode." When we selected the talk and invited Andrei to attend the conference, he accepted and humorously suggested that we pay special attention to the talk's heart characters when publishing details on the conference website and in other promotional materials. I took his suggestion as wise advice, and double checked the site before releasing it to the public—it worked perfectly.

Within a few hours of publication, Andrei dropped me a note indicating that I hadn't heeded his warning, and that the ♥s weren't showing up properly. The problem turned out to be a bug in a specific version of Firefox, and I believe we resolved it by employing the &hearts; entity. This ordeal, while minor, was my first taste of how bad things would become.

If I had to guess, I would estimate that I've spent somewhere in the range of 40 hours wrangling UTF-8 in the past 3 months, which is not only expensive for my employer, but also disheartening as a developer who's got real work to do. Admittedly, this number is inflated, due to the heavy development cycle we completed with the launch of our new site. As time goes on, though, I don't see this situation improving in the short term (though, if we were to glimpse much further into the future, I'm sure we'll eventually consider this a solved problem).

The main problem with using Unicode, today, is that it's partially supported by some parts of any given tool chain. Sometimes it works great, and other times—due to a given piece of software's lack of implementation (or worse, a partial implementation), human error, or full-on bugs—the chain's weakest link shatters in a non-spectacular way.

As any experienced developer knows, having the weak point of a process collapse is a normal part of building complex systems. We're used to it, and we usually manage this by making the systems less complex, by eliminating the parts that are prone to collapse, or by fixing the broken parts. When implementing a system that may contain Unicode data, today, we're challenged with many potential points of failure that are often difficult to identify, and nearly impossible to replace.

To illustrate, consider an overly simplified web development work—and content delivery—flow: developer creates a file, developer edits file, developer uploads the files to the web server, httpd receives a request from a browser, httpd passes the request to PHP, PHP delivers content back to httpd, httpd delivers content to the visitor's browser. If a single part of this flow fails to handle Unicode properly, a snowball effect causes the rest of the chain to fail.

A more typical flow for me (and our code) goes something like this: create file, edit file, commit file to svn, other developers edit file, others commit to svn, release is rolled from svn, visitor browser requests page, httpd parses request, httpd delivers request to PHP, PHP processes request, PHP (client) calls service to fulfill back-end portions of request (encodes the request in an envelope—we use JSON most of the time), PHP (service) receives request, service retrieves and/or stores data in database, service returns data to PHP client, PHP client processes returned data and in turn delivers it to httpd, httpd returns data to browser.

If you'll bear with me for one last list in this article, that means that any (one or more!) of the following could fail when handling unicode: developers' editors, developers' transport (either upload or version control), user's browser, user's http proxy, client-side httpd, client-side PHP, client-side encoder (JSON), service-side httpd (especially HTTP headers), service-side decoder, service-side PHP, service-side database client, database protocol character set imbalance, database table charset, database server, service-side encoder, client-side decoder, client-side PHP (again), client-side httpd (including HTTP headers, again), user's proxy (again), and user's browser (again). I've probably even left some out.

As you can see, there are so many points of failure here, that determining the source of an invalid UTF-8 character is torturous, at best.

Recently, I had to wrestle UTF-8 monsters. In my case, it was a combination of user (me) error and an actual bug in PHP, but it was so non-obvious that it caused most of my day to melt away, trying to resolve the issue. In my case, I had decided to split a file that contained UTF-8 characters into two files. By default, my editor of choice creates new files using my system character encoding—which happened to be Mac-Roman because I hadn't changed it from Leopard's default. The original file was UTF-8, and the characters displayed normally in the new Mac-Roman file. However, when the data was passed to PHP's json_encode function, the string was arbitrartily truncated, due to a PHP bug .

Because the script that triggered the bug pulled the data from a database, and the data was inserted by another script—the one with the broken encoding/characters—it took me entirely too long to trace it back to the change I'd made to that now-split file. For a while, I even thought that MySQL was storing the data poorly because we'd had problems with that before, and also because the database client I was using that day was reporting the characters improperly, due to its own encoding issues. I believe my blood pressure skyrocketed to dangerous levels, that afternoon.

Universal Unicode support is going to be a long uphill battle. I'm not sure I'm ready for it, but I hope it's worth it, nonetheless.

☹