LogoPhly, boy, phly
the weblog and site of Matthew Weier O'Phinney

Wednesday, January 31. 2007

PHP decoding of Javascript encodeURIComponent values

Recently, I was having some issues with a site that was attempting to use UTF-8 in order to support multiple languages. Basically, you could enter UTF-8 characters -- for instance, characters with umlauts -- but they weren't going through to the web services or database correctly. After more debugging, I discovered that when I turned off javascript on the site, and used the degradable interface to submit the form via plain old HTTP, everything worked fine -- which meant the issue was with how we were sending the data via XHR.

We were using Prototype, and in particular, POSTing data back to our site -- which meant that the UI designer was using Form.serialize() to encode the data for transmission. This in turn uses the javascript function encodeURIComponent() to do its dirty work.

I tried a ton of things in PHP to decode this to UTF-8, before stumbling on a solution written in Perl. Basically, the solution uses a regular expression to grab urlencoded hex values out of a string, and then does a double conversion on the value, first to decimal and then to a character. The PHP version looks like this:


$value = preg_replace('/%([0-9a-f]{2})/ie', "chr(hexdec('\\1'))", $value);
 

We have a method in our code to detect if the incoming request is via XHR. In that logic, once XHR is detected, I then pass $_POST through the following function:


function utf8Urldecode($value)
{
    if (is_array($value)) {
        foreach ($key => $val) {
            $value[$key] = utf8Urldecode($val);
        }
    } else {
        $value = preg_replace('/%([0-9a-f]{2})/ie', 'chr(hexdec($1))', (string) $value);
    }

    return $value;
}
 

This casts all UTF-8 urlencoded values in the $_POST array back to UTF-8, and from there we can continue processing as normal.

Man, but I can't wait until PHP 6 comes out and fixes these unicode issues...

Posted by Matthew Weier O'Phinney in PHP at 12:36 | Comments (11) | Trackbacks (0)

Trackbacks
Trackback specific URI for this entry

No Trackbacks

Comments
Display comments as (Linear | Threaded)

Matt, would you like to provide a little bit more information about problem you are trying to solve in your example. Some hex values of your input/output will help in an investigation.

What do you mean under "casts all UTF-8 urlencoded values in the $_POST array back to UTF-8" ?

~Thanks, Andrew
#1 Andrew Bidochko (Link) on 2007-01-31 16:31 (Reply)
Hi,
like Andrew, I'd like more explanation on the problem you solved.

I ask because I have a lot of XHR and I use UTF-8, so I'd like to reproduce the problem to see if I've got the same problem.

Thanks,

chris
#2 chris (Link) on 2007-02-01 03:44 (Reply)
I'm slightly confused here. To my knowledge, encodeURIComponent() encodes Unicode characters in UTF-8, then takes the resulting bytes and %XX-encodes them. If you grab $_POST['foo'], then you should have a valid UTF-8 string (if nobody's been messing with the POSTDATA), as PHP would decode the %XX bytes back to binary. Then whatever interacts with it just has to know it's UTF-8.

Are you using mbstring? Is it configured to mangle input/output, or have some weird internal encoding? Is it overriding functions?

For our site, we just built MySQL and compiled in UTF-8 as the default charset. We had a legacy forum that we have to set latin1 from it's config / db connection stuff, but other than that, everything Just Works(TM).
#3 sapphirecat (Link) on 2007-02-01 12:44 (Reply)
Exactly, encodeURIComponent() returns UTF-8 encoded characters and query string created by Prototype's Form.serialize() might be easily parsed by parse_str() php function.
#4 Andrew Bidochko (Link) on 2007-02-01 13:33 (Reply)
foreach ($key => $val)

should be

foreach ($value as $key => $val)

right?

Just to be a little helpfull ;-)
#5 willem (Link) on 2007-03-13 16:34 (Reply)
As I know, the alreay existing 'urldecode' PHP function do this function as well (at least is works with hungarian letters).
#6 Zedas on 2007-03-17 12:12 (Reply)
Very interesting, if UTF-8 as the default charset. Thanks.
#7 Max (Link) on 2007-06-10 15:33 (Reply)
This line
foreach ($key => $val) {

should really be

foreach ($value as $key => $val) {
#8 Farhan Khan (Link) on 2007-06-13 19:15 (Reply)
have used your function and passed the relevant post parameter. However, am unable to decode JS script. Specially with those having control chars. say i am embedding a copyright symbol in my string. Am using json.stringify and then passing the result to encodeURIComponent. At php it simply doesnt decode the string using json_decode. It adds an extra character before the copyright symbol.

My input string =
Thank you for registering your interest in the Mentoring for Growth
#9 Mehernosh on 2007-07-19 21:54 (Reply)
Also when using multibyte chars, checkout mb_* functions in php :-)
#10 Lyubomir Petrov (Link) on 2007-08-23 06:55 (Reply)
Hello, I use IIS and ASP but I have a similar problem.
I'm having some problems with decoding urls on IIS6.
When I click on URL UTF-8 encoded, resulting querystring replace any non english chars with question marks. Why? I use UTF-8 encoding in my pages !!! Thank you
#11 rob (Link) on 2008-01-27 09:44 (Reply)

Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

 
 
  • Home
  • Resume
  • Blog
  • Phly PEAR Channel
  • Contact Me
  • About this site

ZCE

Zend Education Advisory Board Member

Add to Technorati Favorites

Calendar

Back August '08 Forward
Mon Tue Wed Thu Fri Sat Sun
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Quicksearch

Links

  • PHLY - PHp LibrarY
  • Paul M. Jones
  • Mike Naberezny
  • Shahar Evron
  • Planet PHP
  • Zend Where I now work
  • Garden.org Where I once worked

Archives

August 2008
July 2008
June 2008
Recent...
Older...

Categories

XML Linux
XML Personal
XML Aikido
XML Family
XML Programming
XML Perl
XML PHP

All categories

Syndicate This Blog

XML RSS 0.91 feed
XML RSS 1.0 feed
XML RSS 2.0 feed
ATOM/XML ATOM 0.3 feed
ATOM/XML ATOM 1.0 feed
XML RSS 2.0 Comments

Show tagged entries

xml best practices
xml books
xml conferences
xml dojo
xml dpc08
xml file_fortune
xml linux
xml mvc
xml pear
xml personal
xml php
xml programming
xml ubuntu
xml webinar
xml zendcon
xml zend framework
© 2004 - present, Matthew Weier O'Phinney
matthew-web <at> weierophinney.net