1. A Case of Mistaken Iterator

    Earlier this week, I spent most of a day tracing through code in search of the source of a bug that was causing part of our application to fail in strange ways.

    In the back end, we have models that connect to CouchDB. These models implement the Iterator pattern to allow easy traversal of a record’s keys.

    When I wrote the code to implement Iterator several months ago, I dutifully checked the PHP Manual and adapted the reference example that I found there:

    <?php
    class Record implements Iterator
    {
        // (partial class, showing the iterator implementation only)
     
    	public $_data = array();
     
    	public function rewind()
    	{
    		reset($this->_data);
    	}
     
    	public function current()
    	{
    		return current($this->_data);
    	}
     
    	public function key()
    	{
    		return key($this->_data);
    	}
     
    	public function next()
    	{
    		return next($this->_data);
    	}
     
    	public function valid()
    	{
    		return (current($this->_data) !== false);
    	}
     
    }

    Little did I realize that this implementation is very broken. I’ll explain why, below.

    Over the past few years, I’ve implemented many iterators in this way, using PHP’s implicit array manipulation functions (reset(), current(), key(), next()). These functions are very convenient because PHP arrays are so powerful — arrays in PHP work like ordered hash tables in other languages.

    PHP’s implicit management of an array’s iteration index (the value that is incremented by next() and referenced by key()) is indeed convenient, but the convenience can sometimes be offset by its very implicitness — the value is hidden from you, the PHP programmer.

    In PHP, generic array iteration (without the implicit iterator) isn’t actually as simple as it sounds. Remember that arrays aren’t arrays in the traditional sense, but ordered hash tables. Consider this:

    $data = array('zero','one','two','three');
    for ($i=0; $i<count($data); $i++) {
        // yeah, don't calculate count() on every iteration
        echo "{$data[$i]}\n";
    }

    Output:

    zero
    one
    two
    three

    This first example is easy to iterate — the array contains sequential, numeric, zero-based keys. It gets more complicated when using non-sequential, and non-numeric keys:

    $data = array(
        'apple',
        'cow' => 'moo',
        'pig' => 'oink',
        'orange'
    );
    for ($i=0; $i<count($data); $i++) {
        echo "{$data[$i]}\n";
    }

    Output:

    apple
    orange
    Notice: Undefined offset: 2 in - on line 10
    Notice: Undefined offset: 3 in - on line 10

    I could use foreach, but because a numeric loop illustrates the point more clearly, here’s how I might implement the above code so that it works:

    $data = array(
        'apple',
        'cow' => 'moo',
        'pig' => 'oink',
        'orange'
    );
    $k = array_keys($data);
    for ($i=0; $i<count($data); $i++) {
        echo "{$data[$k[$i]]}\n";
    }

    Output:

    apple
    moo
    oink
    orange

    This brings us back to the Iterator implementation. Why isn’t the code above correct? Take a closer look at this:

    public function valid()
    {
        return (current($this->_data) !== false);
    }

    A value of false in the array is indistinguishable from a false value returned by current(). Using the above implementation with the following array would cause it to bail after orange (and subsequently might cause you to waste a day tracking down the cause):

    array(
        'apple',
        'orange',
        false,
        'banana',
    );

    On Tuesday night, I updated the manual to use an improved Iterator implementation. It’s probably a bit slower (so you can use the internal-indexing implementation if you’re sure your arrays will never contain false), but my implementation is more robust.

    <?php
    /**
     * A mixed-key iterator implementation
     *
     * Note: these array_keys() calls are slow. The array keys could be cached
     * as long as the cache value is invalidated when $_data is changed.
     */
    class It implements Iterator
    {
    	public $_data = array();
    	protected $_index = 0;
     
    	public function rewind()
    	{
    		$this->_index = 0;
    	}
     
    	public function current()
    	{
    		$k = array_keys($this->_data);
    		return $this->_data[$k[$this->_index]];
    	}
     
    	public function key()
    	{
    		$k = array_keys($this->_data);
    		return $k[$this->_index];
    	}
     
    	public function next()
    	{
    		$k = array_keys($this->_data);
    		if (isset($k[++$this->_index])) {
    			return $this->_data[$k[$this->_index]];
    		} else {
    			return false;
    		}
    	}
     
    	public function valid()
    	{
    		$k = array_keys($this->_data);
    		return isset($k[$this->_index]);
    	}
     
    }

    To use it:

    $it = new It;
    $it->_data = array(
        'one' => 1,
        'two' => 2,
        false,
        'four' => 4
    );
    foreach ($it as $k => $v) {
        echo "{$k}: {$v}\n";
    }

    Output:

    one: 1
    two: 2
    0: 
    four: 4

    13 Responses

    Feed for this Entry
    • I have always just used:

      public function valid () {
      $key = $this->key();
      return $key !== false ν
      }

      **FYI: key() is incorrect in the second example.

    • Can you elaborate, Sam? Which example is incorrect?

      S

    • Very nice workaround to a problem I'm sure a lot of people have run in to. I, personally, have had to write numerous lines of code to check values in my array's and it's nice to see other's are coming up with ways to achieve similar results with better scalability.

    • I assume you could do $this->keys = array_keys($array); in __construct to avoid doing it on every call to every method, as it's currently demonstrated in the manual.

      Anyway what Sam said seems better.

    • Sean,
      My mistake, what I meant was, key in the second example should perform an isset() check in order to avoid an E_NOTICE error. It is not technically incorrect.

      And in my example:

      return $key !== false ν

      should be

      return $key !== false and $key !== null

      but that is on the condition that key is written:

      public function key () {
      return key($this->_data);
      }

    • Diogo: the caching you mention is what I hinted at in the class's docblock. Your method will work just fine as long as you're sure the data will never change.

      The implementation I posted doesn't have to worry about invalidating the cache if $_data changes (this could easily be solved by making $_data protected, but I wanted to keep the example as clear as possible).

      S

    • The Manual says: "If the internal pointer points beyond the end of the elements list or the array is empty, key() returns NULL.

      So the check should be:

      public function valid()
      {
      return key($this->var) !== null;
      }

      With this change the old example works fine. No need for an extra object variable or array_keys().

    • if all your iterator is doing is referencing an internal array property, then instead extend IteratorAggregate, which only requires implementing this:

      /**  * @return ArrayIterator */ public function getIterator()//IteratorAggregate interface {    return new ArrayIterator($this->properties); }

      The manual's example code makes some assumptions that are not explained. But Iterators are *not* broken in PHP

    • I think Tom above is correct:

      The real issue is that key() returns null, while current() returns false, when faced with the same situation (as he quoted).

      How someone is supposed to distinguish between an element existing and not (set to false), using current() is prone to error, as you've encountered.

      IteratorAggregate helps, but can only be reliable if using PHP 5.3+.

    • Tom,
      It is possible for key() to return FALSE in 5.2, if the argument is not an object or an array (usually caused by a failure to initialize when dynamically assigning the value of the hash being iterated), which is why I always include a check for FALSE, just to be on the safe side. But I do agree if valid was rewritten, the first example would work fine.

    • Sam,

      Do you have an example? I am not able to reproduce that. I would like to see is I could get a problem.

    • Indeed iterators are trickier than people think and i believe they should always be covered with rigid unit tests. I had problems with existing code like this on a few occasions .... and each every time it took a lot of scratching my head to find that it was the iterator that was broken ..... as i usually assume my latest change broke stuff ;- )

    • Tom,

      using PHP 5.2.5:

      print gettype( key(null) );

      it returns boolean, but you may have to suppress an E_WARNING error.