I've been working with SimpleXML a fair amount lately, and have run into an
issue a number of times with character encodings. Basically, if a string
has a mixture of UTF-8 and non-UTF-8 characters, SimpleXML barfs, claiming
the "String could not be parsed as XML."
I tried a number of solutions, hoping actually to automate it via mbstring
INI settings; these schemes all failed. iconv didn't work properly.
The only thing that did work was to convert the encoding to latin1 -- but
this wreaked havoc with actual UTF-8 characters.
Then, through a series of trial-and-error, all-or-nothing shots, I stumbled
on a simple solution. Basically, I needed to take two steps:
- Detect the current encoding of the string
- Convert that encoding to UTF-8
which is accomplished with:
The conversion is performed even if the detected encoding is UTF-8; the
conversion ensures that all characters in the string are properly
encoded when done.
It's a non-intuitive solution, but it works! QED.
Character encodings, especially when dealing with XML, in PHP can be a pain to say the least. Matthew O'Phinney found this out first-hand when a script he was working with had a mixed character set in one of its strings, giving the XML parser in
Tracked: Feb 29, 13:45