1. Use `env`

    We use quite a few technologies to build our products, but Gimme Bar is still primarily a PHP app.

    To support these apps, we have a number of command-line scripts that handle maintenance tasks, cron jobs, data migration jobs, data processing workers, etc.

    These scripts often run PHP in Gimme Bar land, and we make extensive use of the shebang syntax that uses common Unix practice of putting #!/path/to/interpreter at the beginning of our command-line code. Clearly, this is nothing special—lots of people do exactly this same thing with PHP scripts.

    One thing I have noticed, though, is that many developers of PHP scripts are not aware of the common Unix(y) environment helper, env.

    I put this on Twitter a while ago, and it seemed to resonate with a lot of people:

    The beauty of using /usr/bin/env php instead of just /usr/local/bin/php or /usr/bin/php is that env will use your path to find the php you have set up for your user.

    We've mostly standardized our production and development nodes, but there's no guarantee that PHP will be in the same place on each box where we run it. env, however, is always located in /usr/bin—at least on all of the boxes we control, and on my Mac workstation.

    Maybe we're testing a new version of PHP that happens to be in /opt/php/bin/php, or maybe we have to support an old install on a different distribution than our standard, and PHP is located in /bin/php instead of /usr/bin/php. The practice of using env for this helps us push environmental configurations out of our code and into the actual environment.

    If you distribute a PHP application that has command-line scripts and shebang lines, I encourage you to adopt the practice of making your shebang line #!/usr/bin/env php.

    Note that this doesn't just apply to PHP of course, but I've seen a definite lack of env in the PHP world.

  2. PHP as a templating language

    There’s been quite a bit of talk, recently, in PHP-land about templates and the ramifications of enforcing “pure” PHP scripts by preventing scripts from entering HTML mode. I’m not quite sure how I feel about this RFC, but it got me thinking about the whole idea of using PHP for templating in modern web apps.

    For many years, I was a supporter of using PHP as a templating language to render HTML. However, I really don’t buy into the idea of adding an additional abstraction layer on top of PHP, such as Smarty (and many others). In the past year or so, I’ve come to the realization that even PHP itself is no longer ideally suited to function as the templating engine of current web applications — at least not as the primary templating engine for such apps.

    The reason for this evolution is simple: modern web apps are no longer fully server-driven.

    PHP, as you know, is a server technology. Rendering HTML on the server side was fine for many years, but times have changed. Apps are becoming more and more API-driven, and JSON is quickly becoming the de facto standard for API envelopes.

    We can no longer assume that our data will be rendered in a browser, nor that it will be rendered exclusively in HTML. With Gimme Bar, we render HTML server-side (to reduce page load latency), in JavaScript (when adding or changing elements on an already-rendered page), in our API (upcoming in a future enhancement), in our iPhone app, and certainly in other places that I’m forgetting.

    Asset rendering in Gimme Bar can be complicated — especially for embed assets. We definitely don’t want to maintain the render logic in more than one place (at least not for the main app). We regularly need to render elements in both HTML and JavaScript.

    This is precisely why we don’t directly use PHP to render page elements anymore. We use Mustache (and Mustache-compatible Handlebars). This choice allows us to easily maintain one (partial) template for elements, and we can render those elements on the platform of our liking (which has been diversifying more and more lately, but is still primarily PHP and JavaScript).

    Rendering elements to HTML on the server side, even if transferred through a more dynamic method such as via XHR, really limits what can be done on the display side (where “display side” can mean many things these days — not just browsers).

    We try hard to keep the layers our web applications separated through patterns such as Model/View/Controller, but for as long as we’ve been doing so, we’ve often put the view bits in the wrong place. This was appropriate for many years, but now it is time to leave the rendering duties up to the layer of your application that is actually performing the view. This is often your browser.

    For me, this has become the right way to do things: avoid rendering HTML exclusively on the server side, and use a techonology that can push data rendering to your user’s client.

  3. Natural Load Testing

    My friend Paul Reinheimer has put together an excellent product/service that is probably of use to many of you.

    The product is called Natural Load Testing, and it harnesses some of the machinery that powers the also-excellent wonderproxy and its extremely useful VPN service.

    The gist is that once you've been granted an account (they're in private beta right now, but tell them I sent you, and if you're not a horrible person such as a spammer, scammer, or promoter of online timesuck virtual farming, you'll probably get in—just kidding about that farming clause… sort of), you can record real, practical test suites within the simple confines of your browser, and then you can use those recorded actions to generate huge amounts of test traffic to your application.

    In principle, this idea sounds like nothing new—you might already be familiar with Apache Bench, Siege, http_load, or other similar tools—but NLT is fundamentally different from these in several ways.

    First, as I already mentioned, NLT allows you to easily record user actions for later playback. This is cool but on its own is not much more than merely convenient. What isn't immediately obvious is that in addition to the requests you're making (HTTP verbs and URLs), NLT is recording other extremely important information about your actions, too: I find HTTP headers and timing particularly interesting.

    Next, NLT allows you to use the test recordings in a variable manner. That is, you can replace things like usernames and email addresses (and many other bits of variable content) with system-generated semi-random replacements. This allows you to test things like a full signup process, or semi-anonymous comment posting, all under load.

    NLT also keeps track of secondary content that your browser loads when you're recording the test cases. Things like CSS, JavaScript, images, and XHR/Ajax requests are easy to overlook when using less-intelligent tools. NLT records these requests and (optionally) inserts them into test suites along side primary requests.

    Tools like Siege and the others I've mentioned are useful when you want to know how many concurrent requests your infrastructure can sustain. This is valuable data, but it is often not really practical. Handling a Slashdotting (or whatever the modern day equivalent of such things is called) is only part of the problem. Wouldn't you really prefer to know how many users can concurrently sign up for your app, or how many page-1-to-page-2 transitions you can handle, without bringing your servers to their knees (or alternatively: before scaling up and provisioning new machines in your cluster)?

    Here's a practical example. Since before the first edition of the conference, the Brooklyn Beta site had been running on my personal (read: toy) server. Before launching this year's edition of the site, which included the announcement for Summer Camp, I got a bit nervous about the load. I wasn't so much worried about the rest of my server suffering at the traffic of Brooklyn Beta, but more about the Brooklyn Beta site becoming unavailable due to overloading. This seemed like a good opportunity to give NLT a whirl.

    I recorded a really simple test case by firing up NLT's proxy recorder, and visiting each page, in the order and timeframe I expected real users to browse from page to page. Then we unleashed the NLT worker hounds on the pre-release version of the site (same hardware, just not on the main URL), and discovered that it wasn't doing very well under load. I then set up Varnish and put it into the request chain (we were testing mostly dynamically-generayed static content after all—why not cache it?). The results were clear and definitive: Varnish made a huge difference, and NLT showed us exactly how. (We've since moved the Brooklyn Beta site to EC2, along with most of the rest of our infrastructure.)

    This chart shows several response times over 20 seconds with only 100 concurrent requests without Varnish, and most response times less than 20 milliseconds with 500 concurrent requests. Conclusion: we got over a thousand times better performance with five times as many concurrent workers when Varnish was in play.

    (Aside: I hope to blog in more detail about Varnish one day, but in the meantime, if you've got content you can cache, you should cache it. Look up how to do so with Varnish.)

    If NLT sounds interesting, I encourage you to go watch the demo video and sign up. Then send Paul all kinds of bug reports and feature requests so that he can make it more awesome before he accepts the few dollars you'll be begging him to take in exchange for your use of the service.

  4. Ideas of March

    A year ago, I posted about Ideas of March, which Chris got rolling.

    In it, I pledged to blog more.

    Today, I am not so proud to say that I have mostly failed to do so. If I had to come up with a reason, I'd have to say that, personally, 2011 turned out a whole lot different than I was expecting, back then—and not in a good way.

    Over the last year, however, I did post a few things that I think were interesting, and worth of a re-read (at risk of making this post into a clip show):

    PHP Community Conference
    …a post about why I was excited about going to the PHP Community Conference in Nashville, last May. It turned out to be even better than I expected, and I'm really excited that plans are coming together for a 2012 edition.
    Gimme Bar no longer on CouchDB and Gimme Bar on MongoDB
    …a pair of posts describing some problems we had with CouchDB, and our smooth transition to MongoDB. We're still on MongoDB, and for the most part, I still really like it. I'd hinted about these posts in last year's Ideas of March post.
    Webshell
    …on Webshell which I still use almost daily, but has most certainly fallen out of a reasonable upkeep schedule. I really need to find some time to clean out the cobwebs. If you use HTTP and know JavaScript, you should check it out.
    Aficionado's Curse/Pessimistic Optimism
    …a post that I'm particularly proud of; mostly because I've finally managed to document (and coin a term, I hope) for why things seem so bad, but aren't actually so bad.
    HTTP/1.0 and the Connection header
    …finally, over Christmas, I managed to post about HTTP things (-:

    I was really hoping to do more. Last year, I suggested that I might turn my talk on Fifty tips, tricks and tools into a series of small blog posts, and I'd still like to do this. Hopefully in 2012. I also have a list of other things that I'm really interested in writing about. It's just matter of making time to do so. I plan to do that, this year. Starting with this post.

    I'd also like to get around to writing a thing or two about beer, this year…

    Much of what I said last year is still on my mind. I still miss the blogs we kept, 5+ years ago. Let's fix that.

    </navelgazing>

  5. HTTP 1.0 and the Connection header

    I have a long backlog of things to write about. One of those things is Varnish (more on that in a future post). So, over these Christmas holidays, while intentionally taking a break from real work, I decided to finally do some of the research required before I can really write about how Varnish is going to make your web apps much faster.

    To get some actual numbers, I broke out the Apache Benchmarking utility (ab), and decided to let it loose on my site (100 requests, 10 requests concurrently):

    ab -n 100 -c 10 http://seancoates.com/codes

    To my surprise, this didn't finish almost immediately. The command ran for what seemed like forever. Finally, I was presented with its output (excerpted for your reading pleasure):

    Concurrency Level:      10
    Time taken for tests:   152.476 seconds
    Complete requests:      100
    Failed requests:        0
    Write errors:           0
    Total transferred:      592500 bytes
    HTML transferred:       566900 bytes
    Requests per second:    0.66 [#/sec] (mean)
    Time per request:       15247.644 [ms] (mean)
    Time per request:       1524.764 [ms] (mean, across all concurrent requests)
    Transfer rate:          3.79 [Kbytes/sec] received

    Less than one request per second? That surely doesn't seem right. I chose /codes because the content does not depend on any sort of external service or expensive server-side processing (as described in an earlier post). Manually browsing to this same URL also feels much faster than one request per second. There's something fishy going on here.

    I thought that there might be something off with my server configuration, so in order to rule out a concurrency issue, I decided to benchmark a single request:

    ab -n 1 -c 1 http://seancoates.com/codes

    I expected this page to load in less than 200ms. That seems reasonable for a light page that has no external dependencies, and doesn't even hit a database. Instead, I got this:

    Concurrency Level:      1
    Time taken for tests:   15.090 seconds
    Complete requests:      1
    Failed requests:        0
    Write errors:           0
    Total transferred:      5925 bytes
    HTML transferred:       5669 bytes
    Requests per second:    0.07 [#/sec] (mean)
    Time per request:       15089.559 [ms] (mean)
    Time per request:       15089.559 [ms] (mean, across all concurrent requests)
    Transfer rate:          0.38 [Kbytes/sec] received

    Over 15 seconds to render a single page‽ Clearly, this isn't what's actually happening on my site. I can confirm this with a browser, or very objectively with time and curl:

    $ time curl -s http://seancoates.com/codes > /dev/null
    
    real  0m0.122s
    user  0m0.000s
    sys   0m0.010s

    The next step is to figure out what ab is actually doing that's taking an extra ~15 seconds. Let's crank up the verbosity (might as well go all the way to 11).

    $ ab -v 11 -n 1 -c 1 http://seancoates.com/codes
    (snip)
    Benchmarking seancoates.com (be patient)...INFO: POST header == 
    ---
    GET /codes HTTP/1.0
    Host: seancoates.com
    User-Agent: ApacheBench/2.3
    Accept: */*
    
    
    ---
    LOG: header received:
    HTTP/1.1 200 OK
    Date: Mon, 26 Dec 2011 16:27:32 GMT
    Server: Apache/2.2.17 (Ubuntu) DAV/2 SVN/1.6.12 mod_fcgid/2.3.6 mod_ssl/2.2.17 OpenSSL/0.9.8o PHP/5.3.2
    X-Powered-By: PHP/5.3.2
    Vary: Accept-Encoding
    Content-Length: 5669
    Content-Type: text/html
    
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    (HTML snipped from here)
    
    LOG: Response code = 200
    ..done
    
    (snip)

    This all looked just fine. The really strange thing is that the output stalled right after LOG: Response code = 200 and right before ..done. So, something was causing ab to stall after the request was answered (we got a 200, and it's a small number of bytes).

    This is the part where I remembered that I've seen a similar behaviour before. I've lost countless hours of my life (and now one more) to this problem: some clients (such as PHP's streams) don't handle Keep-Alives in the way that one might expect.

    HTTP is hard. Really hard. Way harder than you think. Actually, it's not that hard if you remember that what you think is probably wrong if you're not absolutely sure that you're right.

    ab or httpd does the wrong thing. I'm not sure which one, and I'm not even 100% sure it's wrong (because the behaviour is not defined in the spec as far as I can tell), but since it's Apache Bench, and Apache httpd, we're talking about here, we'd think they could work together. We'd be wrong, though.

    Here's what's happening: ab is sending a HTTP 1.0 request with no Connection header, and httpd is assuming that it wants to keep the connection open, despite this. So, httpd hangs on to the socket for an additional—you guessed it—15 seconds, after the request is answered.

    There are two easy ways to solve this. First, we can tell ab to actually use keep-alives properly with the -k argument. This allows ab to drop the connection on the client side after the request is complete. It doesn't have to wait for the server to close the connection because it expects the server to keep the socket open, awaiting further requests on the same socket; in the previous scenario, the server behaved the same way, but the client waited for the server to close the connection.

    A more reliable way to ensure that the server closes the connection (and to avoid strange keep-alive related benchmarking artifacts) is to explicitly tell the server to close the connection instead of assuming that it should be kept open. This can be easily accomplished by sending a Connection: close header along with the request:

    $ ab -H "Connection: close" -n1 -c1 http://seancoates.com/codes
    (snip)
    Concurrency Level:      1
    Time taken for tests:   0.118 seconds
    Complete requests:      1
    Failed requests:        0
    Write errors:           0
    Total transferred:      5944 bytes
    HTML transferred:       5669 bytes
    Requests per second:    8.48 [#/sec] (mean)
    Time per request:       117.955 [ms] (mean)
    Time per request:       117.955 [ms] (mean, across all concurrent requests)
    Transfer rate:          49.21 [Kbytes/sec] received
    (snip)

    118ms? That's more like it! A longer, more aggressive (and concurrent) benchmark gives me a result of 88.25 requests per second. That's in the ballpark of what I was expecting for this hardware and URL.

    The moral of the story: state the persistent connection behaviour explicitly whenever making HTTP requests.

  6. Aficionado's Curse / Pessimistic Optimism

    As I've mentioned in previous posts, I like beer. I mean, I really like it. I've tasted many unique, special, rare, and extremely old beers. I even have Beer Judging credentials. I would go as far as to say that I'm a "beer aficionado." I find the idea of a cheap, poorly-made beer (especially when there are superior alternatives on hand) to be almost repulsive.

    I know aficionados in other fields: wine, vodka, scotch, cheese, movies, music, specific genres of music, woodworking, home electronics, office design, and even one guy I might consider a dog food aficionado.

    These people, like myself when it comes to beer, often suffer from what I call "Aficionado's Curse." This syndrome affects experts in many fields, and prevents them from truly enjoying the full gamut of their medium. They are able to truly appreciate the utmost quality, but are turned off by the bottom shelf. Others (non-aficionados) are perfectly happy consuming the most readily available items and occasionally treating themselves to something fancy from the middle.

    Consider someone who is completely immersed in the world of hand-made Swiss timepieces. Now, consider that person wearing a Happy-Meal-derived digital watch with a made-in-China plastic band. Unless they're trying to perfect their ironic hipster look, the cheap watch likely wouldn't fly.

    In high school, long before I considered that this thing might have a name, I had a media teacher who was a film aficionado. He once told our class "When you start making movies—no matter the length—your world will change. You will stop simply being entertained by what is on the screen before your eyes, and instead will wonder how they did that, or you might marvel at the complexity of a specific shot. You'll still enjoy film, but in a different way." This nicely sums up Aficionado's Curse.

    Maybe ignorance is bliss.

    In a phenomenon similar to Aficionado's Curse, I've noticed a trend that cloaks optimism in pessimism. Unfortunately, I am a victim of this, and I try to keep it reined in, but I often fail.

    I have very high expectations for things… nearly everything, in fact. I think that high expectations are generally a good thing, as so many things are—shall we say—of a quality less than the highest.

    I expect machines to work properly, traffic to flow, computers to perform at reasonable levels, food to taste good, service to be quick and friendly, events to respect their schedules and other similar things. More often than not, though, I am let down by this unmaintainable level of optimism. The bad part is that in my letdown, I often find myself complaining (or if I've managed to keep it under control, not complaining) about such things. Not because the thing I've just witnessed is completely broken, but more like because it's sub-optimal in some way. These complaints are (reasonably, I admit) perceived as pessimism. My optimism has precipitated as pessimism.

    I think this happens to smart people quite a bit. I've worked with people who are extremely unpleasant, but also extremely kind and forgiving. This may have been due to the scenario described above.

    I saw this on Fred Wilson's blog a while back, and I think it's relevant:

    "sometimes we make money with brilliant people who are easy to get along with, most often we make money with brilliant people who are hard to get along with, but we rarely make money with normal people who are easy to get along with."

    One of the greatest things I learned from Chris when working at OmniTI was something that he didn't intentionally teach me (at least not as part of my job): it's OK to be let down, but complaining about it doesn't often breed positive change. I've tried to apply this to my public persona in the past few years, and at risk of sounding like a complaint, I think we'd all do well to follow Chris's lead, and strive to be brilliant people who are also easy to get along with.

  7. Webshell

    Webshell is a console-based, JavaScripty web client utility that is great for consuming, debugging and interacting with APIs.

    I use Firefox as my primary browser. The main reason I've been faithful to Mozilla is my set of add-ons. I use Firebug regularly, and I'm not sure what I'd do without JSONovich.

    Last year, as I built Gimme Bar's internal API, I found myself using Curl, extensively, and occasionally Poster, to test and debug my code.

    These two tools have allowed me to interact with HTTP, but not in the most optimal way. Poster's UI is clunky and isn't scriptable (without diving into Firefox extension internals), and Curl requires a lot of Unixy glue to process the results into something more usable than visual inspection.

    I wanted something that would not only make requests, but would let me interact with the result of these requests.

    When working with Evan to debug a problem one day, I mentioned my problem, and said "I really should build something that fixes this." Evan suggested that such a thing would be really useful to him, too, and that he'd be interested in working on it.

    I'd planned on building my version of the tool in PHP. Evan is… not a PHP guy. He's a [whisper]Ruby[/whisper] guy.

    If you've seen me speak at a conference, lately, you've probably seen this graphic:

    Venn Diagram

    It shows that we have diverse roles in Gimme Bar, but everyone who touches our code can speak JavaScript. (This is another, much longer post that I maybe should write, but in the meantime, see this past PHP Advent entry.)

    Thus, Evan suggested that we write Webshell in JavaScript, with node.js as our "framework." Despite the aforementioned affinity for Ruby (cheap shots are fun! (-: ), Evan is a pretty smart guy. It turns out that this was not only convenient, but working with HTTP traffic (especially JSON results (of course)) is way better with JavaScript than it would have been with PHP.

    So, Webshell was born. If you want to see exactly what it does, you should take a look at the readme, which outlines almost all of its functionality.

    If you use curl, or any sort of other ad-hoc queries to inspect, consume, debug or otherwise touch HTTP, I hope you'll take a look at Webshell. It saves me several hours every week, and most of our Gimme Bar administration is done with it. Also, it's on GitHub so please fork and patch. I'd love to see pull requests.

  8. Gimme Bar on MongoDB

    I'm happy to report that Gimme Bar has been running very well on MongoDB since early February of this year. I previously posted on some of the reasons we decided to move off of CouchDB. If you haven't read that, please consider it a prerequisite for the consumption of this post.

    Late last year, I knew that we had no choice but to get off of CouchDB. I was dreading the port. The dread was two-fold. I dreaded learning a new database software, its client interface, administration techniques, and general domain-knowledge, but I also dreaded taking time away from progress on Gimme Bar to do something that I knew would help us in the long term, but was hard to justify from a "product" standpoint.

    I did a lot of reading on MongoDB, and I consulted with Andrei, who'd been using MongoDB with Mapalong since they launched. In the quiet void left by the holiday, on New Year's day this year, I seized the opportunity of absent co-workers, branched our git repository, put fingers-to-keyboard—which I suppose is the coding version of pen-to-paper—and started porting Gimme Bar to Mongo.

    I expected the road to MongoDB to be long, twisty, and paved with uncertainty. Instead, what I found was remarkable—incredible, even.

    Kristina Chodorow has done a near-perfect job of creating the wonderful tandem that makes up PHP's MongoDB extension and its most-excellent documentation. If it wasn't for Kristina (and her employer, 10gen for dedicating her time to this), the porting might have been as-expected: difficult and lengthy. Instead, the experience was pleasant and straightforward. We're not really used to this type of luxury in the PHP world. (-:

    From the start, I knew that our choice of technologies carried a certain amount of risk. I'm kind of a risk-averse person, so I like to weigh the benefits (some of which I outlined in the aforementioned post), and mitigate this risk whenever possible. My mitigation technique involved making my models as dumb as possible about what happens in the code between the models and the database. I wasn't 100% successful in keeping things entirely separate, but the abstraction really paid off. I had to write a lot of code, still, but I didn't have to worry too much about how deep this code had to reach. Other than a few cases, I swapped my CouchDB client code out for an extremely thin wrapper/helper class and re-wrote my queries. The whole process took only around two weeks (of most of my time). Testing, syncing everyone, rebuilding production images and development virtual machine images, and deployment took at least as long.

    That was the story part. Here's comes the opinion part (and remember, this is just my opinion; I could very well be wrong).

    After using both, extensively (for a very specific application, admittedly), I firmly believe that MongoDB is a superior NoSQL datastore solution for PHP based, non-distributed (think Dropbox), non-mobile, web applications.

    This opinion stems almost fully from Mongo's rich query API. In the current version of Gimme Bar, we have a single map/reduce job (for tags). Everything else has been replaced by a straightforward and familiar query. The map/reduce is actually practical, and things like sorting and counting are a breeze with Mongo's cursors. I did have to cheat in a few places that I don't expect to scale very well (I used $in when I should denormalize), but the beauty of this is that I can do these things now, where with Couch, my only option was to denormalize and map. Yes, I know this carries a scaling/sharding and performance penalty, but you know what? I don't care yet. ("Yet" is very important.).

    MongoDB also provides a few other things to developers that were absent in CouchDB. For example, PHP speaks to Mongo through the extension and a native driver. CouchDB uses HTTP for transport. HTTP carries a lot of overhead when you need to do a lot of single-document requests (for example, when topping up a pagination set that's had records de-duplicated). My favourite difference, though, is in the atomic operations, such as findAndModify, which make a huge difference both logic- and performance-wise, at least for Gimme Bar.

    Of course, there are two sides to every coin. There are CouchDB features that I miss. Namely: replication, change notification, CouchDB-Lucene (we're using ElasticSearch and manual indexing now), and Futon.

    Do I think MongoDB is superior to CouchDB? It depends what you're using it for. If you need truly excellent eventual-consistency replication, CouchDB might be a better choice. If you want to have your JavaScript applications talk directly to the datastore, CouchDB is definitely the way to go. Do I have a problem with CouchDB, their developers or their community? Not at all. It's just not a good fit for the kind of app we're building.

    The bottom line is that I'm extremely happy with our port to MongoDB, and I don't have any regrets about switching other than not doing it sooner.

  9. Gimme Bar no longer on CouchDB

    As mentioned in a previous post, we started building Gimme Bar a little over a year ago. We did a lot of things right, but we also did some things wrong.

    Since—in those early days—I was the only developer, and since most of my professional development experience is in PHP, that choice was obvious. I also started building the API before the front-end. I chose a really simple routing model for the back-end, and got to work, sans framework. Our back-end code is still really lean (for the most part), and I'm still (mostly (-: ) proud of it.

    When it came time to select a datastore, I chose something a bit more risky, with Cameron's blessing.

    Having just spent the best part of a year and a half working with PostgreSQL at OmniTI, I felt it was time to try something new. We knew this carried risks, but the timing was good, and—quite frankly—I was simply bored of hacking on stored procedures in PL/pgSQL. We wanted something that could be expected to scale (eventually, when we need it), without deep in-house expertise, but also something that I'd find fun to work on. I love learning new things, so we thought we'd give one of the NoSQL solutions a whirl.

    In those days (January 2010), the main NoSQL contenders for building a web application were—at least in our minds—CouchDB and MongoDB. Also in those days, MongoDB didn't quite feel like it was ready for us. I could, of course, be wrong, but I figured that since I didn't know either of these systems very well, the best way to find out was to just pick one and go with it. The thing that ultimately pushed us to the CouchDB camp was a mild professional relationship with some of the CouchDB guys. So, we built the first versions of Gimme Bar on top of Linux, Apache, PHP 5.3, Lithium (on the front-end), jQuery and CouchDB.

    By summer 2010, we began work on adding social features (which have since been hidden) to Gimme Bar, and CouchDB started giving us trouble. This wasn't CouchDB's fault, really. It was more of an architectural problem. We were trying to solve a relational problem with a database that by-design knew nothing about how to handle relationships.

    Now might be a good time to explain document-independence and map/reduce, but I fear that would take up more attention than you've kindly offered to this article, and it's going to be long even without a detailed tutorial. Here's the short version: CouchDB stores structured objects as (JSON) documents. These documents don't know anything about their peers. To "query" (for lack of a better term) Couch, you need to write a map function (in JavaScript or Erlang, by default) that is passed all documents in the database and emits keys and values to an index that matches your map's criteria. These keys can be (roughly) sorted, and to "query" your documents, you jump to a specific part of this sorted index and grab one or more documents in the sequence. From what I understand of map/reduce (and my only practical experience so far is with CouchDB), this is how other systems such as Hadoop work, too.

    There is tremendous value to a system like this. Once the index is generated, it can be incrementally updated, and querying a huge dataset is fast and efficient. The reduce side of map/reduce (we had barely a handful of reduce functions) is also incredibly powerful for calculating aggregates of the map data, but it's also intentionally limited to small subsets of the mapped data. These intentional limits allow map/reduce functions to be highly parallelizable. To run a map on 100 servers, the dataset can be split into 100 pieces, and each server can process its individual chunk safely and in parallel.

    This power and flexibility has an architectural cost. Over a decade of professional development with various relational databases taught me that in order to keep one's schema descriptive and robust, one must always (for small values of "always") keep data normalized until a performance problem forces denormalization. With a document-oriented datastore like CouchDB or MongoDB, denormalization is part of the design.

    A while ago, I made an extremely stripped-down example of how something like user relationships are handled in Gimme Bar with CouchDB. This document is for the user named "aaron" (_id: c988a29740241c7d20fc7974be05ec54). Aaron is following bob (_id: c988a29740241c7d20fc7974be05f67d), chris (_id: c988a29740241c7d20fc7974be05ff71), and dale (_id: c988a29740241c7d20fc7974be060bb4). You can see the references to the "following" users in aaron's document. I also published example maps of how someone might go about querying this (small) set.

    The specific problem that we ran into with CouchDB is that our "timeline" page showed the collected assets of users that the currently-logged-in user is following. So, aaron would see assets that belong to bob, chris and dale. This, in itself, isn't terribly difficult; we just needed to query once for each of aaron's follows. The problem was further complicated when a requirement was raised to not only see the above, but also to collapse duplicates into one displayed asset (if bob and chris collected the same asset, aaron would only see it once). Oh, and also, these assets needed to be sorted by their capture time. These requirements made the chain of documents extremely complicated to query. In a relational system, a few (admittedly expensive) joins would have taken care of it in short order.

    I spent a lot of time fighting with CouchDB to solve this problem. I asked in the #couchdb channel on Freenode, posted to the mailing list and even resorted to StackOverflow (a couple times) before coming up with a "solution." I put the word "solution" in quotes there because what I was told to do only partially solved our problem.

    The general concensus was that we should denormalize our follow/following + asset records in an extreme way (as you can see in the StackOverflow posts, above). I ended up creating an interim index of all of a user's followers/following links, plus an index of all of the media hashes (what we use to uniquely identify assets, even when captured by different users). Those documents got pretty big pretty quickly (even though we had less than 100 users at the time). Here's an example: Cameron's FollowersIndex document.

    As you might guess, even a system designed to handle large documents like this (such as CouchDB) would have a hard time with the sheer size. Every time an asset was captured, it would get injected into the FollowersIndex documents, which caused a reindex… which used up a lot of RAM, and caused bottlenecks. Severe bottlenecks. Our 8GB of RAM was easily exhausted by our JavaScript map function. Think about that. 8GB… for <100 users. This was not going to survive. Turns out we were exhausting Erlang's memory allocator and actually crashing CouchDB. From userspace. I asked around, and the proposed solution to this problem-within-a-problem was to re-write the JavaScript map as Erlang to avoid the JSON conversion overhead. At this point, I was desperate. I had Evan (who is a valuable member of the team, and is a far superior computer scientist to me) translate the JS to Erlang. What he came up with made my head hurt, but it worked. And by "worked," I mean that it didn't crash CouchDB and send it into a recovery spiral (crash, respawn, reindex, crash, repeat)… but it did work. Enough to get us by for a few weeks, and that's what we did: get by. The index regeneration for the friends feed was so slow that I had to use delayed indexes and reindex in cron every minute. CouchDB was using most of our computing resources, and we knew we couldn't sustain any growth with this system.

    At this point, we decided to cut our losses, and I went to investigate other options, including MySQL and MongoDB. My next blog post will be on why I think MongoDB is a superior solution for building web applications, despite CouchDB being better in certain areas.

  10. PHP Community Conference

    I was once told that "the only reason you're successful is that you were at the right place at the right time." Other than the word "only" in that declaration, the accuser was mostly right. The reason I'm [moderately] successful is that I was at the right place at the right time. The subtlety in the second statement is in the reason I was at the right place at the magical time.

    I firmly believe that my technical skills are only part of my value, career-wise. Looking back on my career so far, I can definitely see opportunities that arose because of being at the right place. What wasn't considered in the flippant statement was why I was there, when I was.

    To me, it's clear: I've taken measures to put myself in the right place, when it was beneficial to do so. I've been doing this for years, and it's paid off.

    Want to know how I became the Editor-in-Chief of php|architect magazine, a Web Architect at OmniTI, and was put into contact with my co-founder for Gimme Bar? Sure, my abilities to build web stuff played into all of those roles, but the way I found myself in all of those positions was by asking. Yes, asking.

    Was I in the right place at the right time when I noticed Marco commenting about having to edit the current issue of php|architect, and I chimed in "hey, I kind of actually like that sort of thing," half a decade ago? Definitely, but it's more complicated than "luck."

    Similarly, when I approached Chris Shiflett about working with OmniTI, his immediate reaction was "Of course there's room on my team for you; we'll just need to work out the details." Am I that good, when it comes to coding, architecting large deployments, and managing a team? Definitely not—even less so back then.

    The real question is why was I hanging out on IRC when Marco was venting, or how was it so easy for me to have Chris's ear? The answer is simple: I'd established myself as part of the PHP community, and had a standing with those guys, even without having ever worked with them, directly (I had written for php|architect before, but it wasn't under Marco's direct supervision).

    I assume that many of you readers are already members of the community in some way. That could be as simple as participating on mailing lists or forums, helping reproduce bugs, or fixing grammatical errors in the manual. One of the best ways I've found to connect with the community, though, is in person.

    Nearly everyone I know and have had a long-term relationship with, in the PHP community, I met at a conference. Sure, I'd often "known" someone from their online persona, but it's hard to really "know" someone until you've spent some face time with them, preferably with a beer or two between you.

    This is one of the main reasons that I think that the PHP Community Conference in Nashville, in just about a month, is important, and why I think you should go. I have no personal stake in this (in fact, since it's run by the community, the only stake to be had is a potential loss by the organizers; there is no profit to be had), I just think it's going to be a great event, and a wonderful opportunity for attendees—and not just from a career perspective, but I expect everyone who attends will become more valuable to their current employers, too, based simply on knowledge gained and connections made. (There's a huge amount of value in being able to fire off a friendly email to the author of (e.g.) the memcached extension, when you get stuck, and to already be on a first-name basis.)

    I'm also speaking, there, on Gimme Bar. It won't be a pitch. It will be more of a show-and-tell session on which technologies we use, how we've built what we have so far, what I think we've done right, and a frank discussion on the mistakes we've made (so far (-: ).

    If you can, you should make it to the PHP Community Conference, and be in the right place at the right time, whether it's Nashville on April 21 and 22, or sometime in your future.