Recipe 16.12 Reading or Writing Unicode Characters
16.12.1 Problem
You want to read
Unicode-encoded characters from a file, database, or form; or, you
want to write Unicode-encoded characters.
16.12.2 Solution
Use
utf8_encode( ) to convert single-byte ISO-8859-1 encoded
characters to UTF-8:
print utf8_encode('Kurt Gödel is swell.');
Use utf8_decode( ) to
convert UTF-8 encoded characters to single-byte ISO-8859-1 encoded
characters:
print utf8_decode("Kurt G\xc3\xb6del is swell.");
16.12.3 Discussion
There are 256 possible
ASCII characters.
The characters between codes 0 and 127 are standardized: control
characters, letters and numbers, and punctuation. There are different
rules, however, for the characters that codes 128-255 map to. One
encoding is called ISO-8859-1, which includes characters necessary
for writing most European languages, such as the ö
in Gödel or the ñ in pestaña. Many
languages, though, require more than 256 characters, and a character
set that can express more than one language requires even more
characters. This is where Unicode saves the day; its
UTF-8
encoding can represent more than a million characters.
This increased functionality comes at the cost of space. ASCII
characters are stored in just one byte; UTF-8 encoded characters need
up to four bytes. Table 16-2 shows the
byte representations of UTF-8 encoded
characters.
Table 16-2. UTF-8 byte representation
|
0x00000000 - 0x0000007F
|
1
|
0xxxxxxx
|
|
|
|
|
0x00000080 - 0x000007FF
|
2
|
110xxxxx
|
10xxxxxx
|
|
|
|
0x00000800 - 0x0000FFFF
|
3
|
1110xxxx
|
10xxxxxx
|
10xxxxxx
|
|
|
0x00010000 - 0x001FFFFF
|
4
|
11110xxx
|
10xxxxxx
|
10xxxxxx
|
10xxxxxx
|
In Table 16-2, the x positions
represent bits used for actual character data. The
least
significant bit is the rightmost bit in the rightmost byte. In
multibyte characters, the number of leading 1 bits in the leftmost
byte is the same as the number of bytes in the character.
16.12.4 See Also
Documentation on utf8_encode( ) at
http://www.php.net/utf8-encode and
utf8_decode( ) at
http://www.php.net/utf8-decode; more information
on Unicode is available at the Unicode Consortium's
home page, http://www.unicode.org; the UTF-8 and
Unicode FAQ at
http://www.cl.cam.ac.uk/~mgk25/unicode.html is
also helpful.
|