A very long time ago (three and a half years ago), I wrote a little utility to help us with the 2008 edition of PHP Advent. The utility is called Lexentity, and my recent blogging uptake made me realize that I’ve never actually written about it on here, so here it is (mostly borrowed from the README).
Let's face it--this sentence is much "uglier" than the one below it.
Let’s face it–this sentence is much “prettier” than the one above it.
Lexentity is a simple piece of software that takes HTML as input and outputs a context-aware, medium-neutral representation of that HTML, with apostrophes, quotes, emdashes, ellipses, accents, etc., replaced with their respective numeric XML/Unicode entities.
Context is important. It is especially important when considering a piece of HTML like this:
<p>…and here's the example code:</p> <pre><code>echo "watermelon!\n";</pre></code>
Contextually, you’d want
here's to become
here’s (note the apostrophe), but you certainly don’t want the code to read
A fancy/smart/curly quotes apostrophe is appropriate, but curly quotes in the code are likely to cause a parse error.
Lexentity understands its context, and acts appropriately, by means of lexical analysis, and turning tokens into text, not through a mostly-naive and overly-complicated regular expression.
Regarding context, my friend and former colleague Jon Gibbins said it best in this piece on his blog: In modern systems, you can’t count on your HTML to always be represented as HTML. It’s often (poorly) embedded in RSS or other HTML-like media, as XML.
Therefore, it is important to avoid HTML-specific entities like
…, and instead use their Unicode code point to form numeric entities such as
…. This ensures proper display on any (for small values of “any”) terminal that can properly render Unicode XML, and avoids missing entity errors.
We still use it for PHP Advent, and I ran this post through it. (-: