PHP decoding of Javascript encodeURIComponent values

Recently, I was having some issues with a site that was attempting to use UTF-8 in order to support multiple languages. Basically, you could enter UTF-8 characters — for instance, characters with umlauts — but they weren't going through to the web services or database correctly. After more debugging, I discovered that when I turned off javascript on the site, and used the degradable interface to submit the form via plain old HTTP, everything worked fine — which meant the issue was with how we were sending the data via XHR.

We were using Prototype, and in particular, POSTing data back to our site — which meant that the UI designer was using Form.serialize() to encode the data for transmission. This in turn uses the javascript function encodeURIComponent() to do its dirty work.

I tried a ton of things in PHP to decode this to UTF-8, before stumbling on a solution written in Perl. Basically, the solution uses a regular expression to grab urlencoded hex values out of a string, and then does a double conversion on the value, first to decimal and then to a character. The PHP version looks like this:

$value = preg_replace('/%([0-9a-f]{2})/ie', \"chr(hexdec('\1'))\", $value);

We have a method in our code to detect if the incoming request is via XHR. In that logic, once XHR is detected, I then pass $_POST through the following function:

function utf8Urldecode($value)
{
    if (is_array($value)) {
        foreach ($key => $val) {
            $value[$key] = utf8Urldecode($val);
        }
    } else {
        $value = preg_replace('/%([0-9a-f]{2})/ie', 'chr(hexdec($1))', (string) $value);
    }

    return $value;
}

This casts all UTF-8 urlencoded values in the $_POST array back to UTF-8, and from there we can continue processing as normal.

Man, but I can't wait until PHP 6 comes out and fixes these unicode issues…