LogoPhly, boy, phly
the weblog and site of Matthew Weier O'Phinney

Tuesday, May 16. 2006

mbstring comes to the rescue

I've been working with SimpleXML a fair amount lately, and have run into an issue a number of times with character encodings. Basically, if a string has a mixture of UTF-8 and non-UTF-8 characters, SimpleXML barfs, claiming the "String could not be parsed as XML."

I tried a number of solutions, hoping actually to automate it via mbstring INI settings; these schemes all failed. iconv didn't work properly. The only thing that did work was to convert the encoding to latin1 -- but this wreaked havoc with actual UTF-8 characters.

Then, through a series of trial-and-error, all-or-nothing shots, I stumbled on a simple solution. Basically, I needed to take two steps:

  • Detect the current encoding of the string
  • Convert that encoding to UTF-8

which is accomplished with:


$enc = mb_detect_encoding($xml);
$xml = mb_convert_encoding($xml, 'UTF-8', $enc);
 

The conversion is performed even if the detected encoding is UTF-8; the conversion ensures that all characters in the string are properly encoded when done.

It's a non-intuitive solution, but it works! QED.

Posted by Matthew Weier O'Phinney in PHP at 18:25 | Comments (8) | Trackback (1)

Trackbacks
Trackback specific URI for this entry

Matthew O'Phinney's Blog: mbstring comes to the rescue
Character encodings, especially when dealing with XML, in PHP can be a pain to say the least. Matthew O'Phinney found this out first-hand when a script he was working with had a mixed character set in one of its strings, giving the XML parser in
Weblog: The PHP Grind
Tracked: Feb 29, 13:45

Comments
Display comments as (Linear | Threaded)

I like mbstring - it's an easy way to push php < 6 into the unicode world. Overloading on, Database encoding, connection enconding (in mysql case mysql 4.1) and html charset set to utf8 - that's all and it works nice.
#1 soenke (Link) on 2006-05-17 03:23 (Reply)
You can replace

$enc = mb_detect_encoding($xml);
$xml = mb_convert_encoding($xml, 'UTF-8', $enc);

to

$xml = mb_convert_encoding($xml, 'UTF-8', 'auto');

cya
#2 Pedro Faria (Link) on 2006-05-17 07:47 (Reply)
The problem with using 'auto' is that it's not a full list of encodings -- as the php manual says, it consists of "ASCII, JIS, UTF-8, EUC-JP, SJIS". If the string encoding is outside of that list, using 'auto' could lead to character mangling.
#2.1 Matthew Weier O'Phinney (Link) on 2006-05-17 08:44 (Reply)
[quote]I've been working with SimpleXML a fair amount lately, and have run into an issue a number of times with character encodings. Basically, if a string has a mixture of UTF-8 and non-UTF-8 characters, SimpleXML barfs, claiming the "String could not be parsed as XML."[/quote]

I am wondering why you have a string that contains both UTF-8 and non-UTF=8 characters in one string (XML document?). This is simply not a valid string and should be rejected as input alltogether.
#3 Derick Rethans (Link) on 2006-05-18 12:25 (Reply)
I'm actually not sure how it's happening. The XML is coming in via an xmlrpc client, and the only thing I can think of is that the base encoding on the client end is something other than UTF-8, and when it forms the XML request, it's inserting UTF-8 characters into a non-UTF-8 XML base (i.e., the XML tags are non UTF-8, but the data they contain is).
#3.1 Matthew Weier O'Phinney (Link) on 2006-05-18 12:31 (Reply)
xml-rpc is broken as designed, and does not support utf-8.

http://www.decafbad.com/blog/2002/11/26/oooccb
#3.1.1 Gregor J. Rothfuss (Link) on 2006-05-18 13:20 (Reply)
Thanks for the link -- I was unaware of that issue. Certainly explains a lot!
#3.1.1.1 Matthew Weier O'Phinney (Link) on 2006-05-18 13:25 (Reply)
I had the same problem, but my solution was use urlencode.
#4 edude souza (Link) on 2008-01-02 09:04 (Reply)

Add Comment

Standard emoticons like :-) and ;-) are converted to images.
E-Mail addresses will not be displayed and will only be used for E-Mail notifications

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

 
 
  • Home
  • Resume
  • Blog
  • Phly PEAR Channel
  • Contact Me
  • About this site

ZCE

Zend Education Advisory Board Member

Add to Technorati Favorites

Calendar

Back July '08
Mon Tue Wed Thu Fri Sat Sun
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Quicksearch

Links

  • PHLY - PHp LibrarY
  • Paul M. Jones
  • Mike Naberezny
  • Shahar Evron
  • Planet PHP
  • Zend Where I now work
  • Garden.org Where I once worked

Archives

July 2008
June 2008
May 2008
Recent...
Older...

Categories

XML Linux
XML Personal
XML Aikido
XML Family
XML Programming
XML Perl
XML PHP

All categories

Syndicate This Blog

XML RSS 0.91 feed
XML RSS 1.0 feed
XML RSS 2.0 feed
ATOM/XML ATOM 0.3 feed
ATOM/XML ATOM 1.0 feed
XML RSS 2.0 Comments

Show tagged entries

xml best practices
xml books
xml conferences
xml dojo
xml dpc08
xml file_fortune
xml linux
xml mvc
xml pear
xml personal
xml php
xml programming
xml ubuntu
xml webinar
xml zendcon
xml zend framework
© 2004 - present, Matthew Weier O'Phinney
matthew-web <at> weierophinney.net