142 lines
13 KiB
HTML
142 lines
13 KiB
HTML
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
||
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
|
||
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>The Be Book - System Overview - The Interface Kit</title><link rel="stylesheet" href="be_book.css" type="text/css" media="all" /><link rel="shortcut icon" type="image/vnd.microsoft.icon" href="./images/favicon.ico" /><!--[if IE]>
|
||
<link rel="stylesheet" type="text/css" href="be_book_ie.css" />
|
||
<![endif]--><meta name="generator" content="DocBook XSL Stylesheets V1.73.2" /><meta name="keywords" content="Access, BeOS, BeBook, API" /><link rel="start" href="index.html" title="The Be Book" /><link rel="up" href="TheInterfaceKit_Overview.html" title="The Interface Kit" /><link rel="prev" href="TheInterfaceKit_The_Coordinate_Space.html" title="The Coordinate Space" /><link rel="next" href="DragAndDrop.html" title="Drag And Drop" /></head><body><div id="header"><div id="headerT"><div id="headerTL"><a accesskey="p" href="TheInterfaceKit_The_Coordinate_Space.html" title="The Coordinate Space"><img src="./images/navigation/prev.png" alt="Prev" /></a> <a accesskey="u" href="TheInterfaceKit_Overview.html" title="The Interface Kit"><img src="./images/navigation/up.png" alt="Up" /></a> <a accesskey="n" href="DragAndDrop.html" title="Drag And Drop"><img src="./images/navigation/next.png" alt="Next" /></a></div><div id="headerTR"><div id="navigpeople"><a href="http://www.haiku-os.org"><img src="./images/People_24.png" alt="haiku-os.org" title="Visit The Haiku Website" /></a></div><div class="navighome" title="Home"><a accesskey="h" href="index.html"><img src="./images/navigation/home.png" alt="Home" /></a></div><div class="navigboxed" id="navigindex"><a accesskey="i" href="ClassIndex.html" title="Index">I</a></div><div class="navigboxed" id="naviglang" title="English">en</div></div><div id="headerTC">The Be Book - System Overview - The Interface Kit</div></div><div id="headerB">Prev: <a href="TheInterfaceKit_The_Coordinate_Space.html">The Coordinate Space</a> Up: <a href="TheInterfaceKit_Overview.html">The Interface Kit</a> Next: <a href="DragAndDrop.html">Drag And Drop</a></div><hr /></div><div class="section"><div xmlns="" xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><div xmlns:d="http://docbook.org/ns/docbook"><h2 xmlns="http://www.w3.org/1999/xhtml" class="title"><a id="TheInterfaceKit_Character_Encoding"></a>Character Encoding</h2></div></div></div><p>
|
||
The BeOS encodes characters using the UTF-8 transformation of Unicode
|
||
character values. Unicode is a standard encoding scheme for all the major
|
||
scripts of the world—including, among others, extended Latin,
|
||
Cyrillic, Greek, Devanagiri, Telugu, Hebrew, Arabic, Tibetan, and the
|
||
various character sets used by Chinese, Japanese, and Korean. It assigns
|
||
a unique and unambiguous 16-bit value to each character, making it
|
||
possible for characters from various languages to co-exist in the same
|
||
document. Unicode makes it simpler to write language-aware software
|
||
(though it doesn't solve all the problems). It also makes a wide variety
|
||
of symbols available to an application, even if it's not concerned with
|
||
covering more than one language.
|
||
</p><p>
|
||
Unicode's one disadvantage is that all characters have a width of 16
|
||
bits. Although 16 bits are necessary for a universal encoding system and
|
||
a fixed width for all characters is important for the standard, there are
|
||
many contexts in which byte-sized characters would be easier to work with
|
||
and take up less memory (besides being more familiar and backwards
|
||
compatible with existing code). UTF-8 is designed to address this problem.
|
||
</p><div class="section"><div xmlns="" xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><hr /><div xmlns:d="http://docbook.org/ns/docbook"><h3 xmlns="http://www.w3.org/1999/xhtml" class="title"><a id="TheInterfaceKit_Character_Encoding_UTF8"></a>UTF-8</h3></div></div></div><p>
|
||
UTF-8 stands for "UCS Transformation Format, 8-bit form" (and UCS stands
|
||
for "Universal Multiple-Octet Character Set," another name for Unicode).
|
||
UTF-8 transforms 16-bit Unicode values into a variable number of 8-bit
|
||
units. It takes advantage of the fact that for values equal to or less
|
||
than 0x007f, the Unicode character set matches the 7-bit ASCII character
|
||
set—in other words, Unicode adopts the ASCII standard, but encodes
|
||
each character in 16 bits. UTF-8 strips ASCII values back to 8 bits and
|
||
uses two or three bytes to encode Unicode values over 0x007f.
|
||
</p><p>
|
||
The high bit of each UTF-8 byte indicates the role it plays in the
|
||
encoding:
|
||
</p><ul class="itemizedlist"><li><p>
|
||
If the high bit is 0, the byte stands alone and encodes an ASCII
|
||
value.
|
||
</p></li><li><p>
|
||
If the high bit is 1, the byte is part of a multiple-byte character
|
||
representation.
|
||
</p></li></ul><p>
|
||
In addition, the first byte of a multibyte character indicates how many
|
||
bytes are in the encoding: The number of high bits that are set to 1
|
||
(before a bit is 0) is the number of bytes it takes to represent the
|
||
character. Therefore, the first byte of a multibyte character will always
|
||
have at least two high bits set. The other bytes in a multibyte encoding
|
||
have just one high bit set.
|
||
</p><p>
|
||
To illustrate, a character encoded in one UTF-8 byte will look like this
|
||
(where a '1' or a '0' indicates a control bit specified by the standard
|
||
and an 'x' is a bit that contributes to the character value):
|
||
</p><pre class="screen">
|
||
0xxxxxxx
|
||
</pre><p>
|
||
A character encoded in two bytes has the following arrangement of bits:
|
||
</p><pre class="screen">
|
||
110xxxxx 10xxxxxx
|
||
</pre><p>
|
||
And a character encoded in three bytes is laid out as follows:
|
||
</p><pre class="screen">
|
||
1110xxxx 10xxxxxx 10xxxxxx
|
||
</pre><p>
|
||
Note that any 16-bit value can be encoded in three UTF-8 bytes. However,
|
||
UTF-8 discards leading zeroes and always uses the fewest possible number
|
||
of bytes—so it can encode Unicode values less than 0x0080 in a
|
||
single byte and values less than 0x0800 in two bytes.
|
||
</p><p>
|
||
In addition to the codings illustrated above, UTF-8 takes four bytes to
|
||
translate a Unicode surrogate pair—two conjoined 16-bit values that
|
||
together encode a character that's not part of the standard. Surrogates
|
||
are extremely rare.
|
||
</p></div><div class="section"><div xmlns="" xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><hr /><div xmlns:d="http://docbook.org/ns/docbook"><h3 xmlns="http://www.w3.org/1999/xhtml" class="title"><a id="TheInterfaceKit_Character_Encoding_ASCII_Compatibility"></a>ASCII Compatibility</h3></div></div></div><p>
|
||
The UTF-8 encoding scheme has several advantages:
|
||
</p><ul class="itemizedlist"><li><p>
|
||
The single byte that encodes an ASCII value can't be confused with a
|
||
byte that's part of a multiple-byte encoding. You can test a UTF-8 byte
|
||
for an ASCII value without considering surrounding bytes; if there's a
|
||
match, you can be sure the byte is the ASCII character. UTF-8 is fully
|
||
compatible with ASCII.
|
||
</p></li><li><p>
|
||
The first (or only) byte of a character can't be confused with a byte
|
||
inside a multibyte sequence. It's simple to find where a character
|
||
begins. For example, this macro will do it:
|
||
</p><pre class="programlisting example cpp">#define <code class="function">BEGINS_CHAR</code>(<code class="parameter">byte</code>) ((<code class="parameter">byte</code> & 0xc0) != 0x80)</pre></li><li><p>
|
||
The string functions in the standard C library—for example,
|
||
strcat() and strlen()—can operate on a UTF-8 string.
|
||
</p></li><li><p>
|
||
However, it's important to remember that strlen() measures the string
|
||
in bytes, not characters. Some Interface Kit functions, like
|
||
GetEscapements() in the BFont class, ask for a character count;
|
||
strlen() can't provide the answer. Instead, you need to do something
|
||
like this to count the characters in a string:
|
||
</p><pre class="programlisting example cpp"><span class="type">int32</span> <code class="varname">count</code> = 0;
|
||
while ( *<code class="varname">p</code> != '0' ) {
|
||
if ( <code class="function">BEGINS_CHAR</code>(*<code class="varname">p</code>) )
|
||
<code class="varname">count</code>++;
|
||
<code class="varname">p</code>++;
|
||
}</pre></li><li><p>
|
||
UTF-8 preserves the numerical ordering of Unicode character values.
|
||
String comparison functions—such as <code class="function">strcasecmp()</code>—will put
|
||
UTF-8 strings in the correct order.
|
||
</p></li><li><p>
|
||
However, you should be careful when using the string comparison
|
||
functions to order a set of UTF-8 strings. Unicode tries for a
|
||
universal encoding and orders characters in a way that's generically
|
||
correct, but it may not be correct for specific characters in specific
|
||
languages. (Because it follows ASCII, UTF-8 is correct for English.)
|
||
</p></li><li><p>
|
||
For European languages, UTF-8 generally yields more compact data
|
||
representations than would Unicode. Most of the characters in a string
|
||
can be encoded in a single byte. In many other cases, UTF-8 is no less
|
||
compact than Unicode.
|
||
</p></li></ul></div><div class="section"><div xmlns="" xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><hr /><div xmlns:d="http://docbook.org/ns/docbook"><h3 xmlns="http://www.w3.org/1999/xhtml" class="title"><a id="TheInterfaceKit_Character_Encoding_UTF8_And_The_BeOS"></a>UTF-8 and the BeOS</h3></div></div></div><p>
|
||
The BeOS assumes UTF-8 encoding in most cases. For example, a <code class="constant">B_KEY_DOWN</code>
|
||
message reports the character that's mapped to the key the user pressed
|
||
as a UTF-8 value. That value is then passed as a string to
|
||
<a class="link" href="BView.html#BView_KeyDown" title="KeyDown()"><code class="methodname">KeyDown()</code></a>
|
||
along with the byte count:
|
||
</p><pre class="programlisting example cpp">virtual <span class="type">void</span> <code class="methodname">KeyDown</code>(<span class="type">const char*</span> <code class="parameter">bytes</code>, <span class="type">int32</span> <code class="parameter">numBytes</code>);</pre><p>
|
||
You can expect the bytes string to always contain at least one byte. And,
|
||
of course, you can test it for an ASCII value without caring whether it's
|
||
UTF-8:
|
||
</p><pre class="programlisting example cpp">if ( <code class="varname">bytes</code>[0] == <code class="constant">B_TAB</code> )
|
||
. . .</pre><p>
|
||
Similarly, objects that display text in the user interface—such as
|
||
window titles and button labels—expect to be passed UTF-8 encoded
|
||
strings, and hand you a UTF-8 string if you ask for the title or label.
|
||
These objects display text using a system font—either the system
|
||
plain font (<code class="varname">be_plain_font</code>) or the bold font (<code class="varname">be_bold_font</code>). The BFont
|
||
class allows other character encodings, which you may need to use in
|
||
limited circumstances from time to time, but the system fonts are
|
||
constrained to UTF-8 (<code class="constant">B_UNICODE_UTF8</code> encoding). The FontPanel preferences
|
||
application doesn't permit users to change the encoding of a system font.
|
||
</p><p>
|
||
Unicode and UTF-8 are documented in The Unicode Standard, Version 2.0,
|
||
published by Addison-Wesley. See that book for complete information on
|
||
Unicode and for a description of how UTF-8 encodes surrogate pairs.
|
||
</p></div></div><div id="footer"><hr /><div id="footerT">Prev: <a href="TheInterfaceKit_The_Coordinate_Space.html">The Coordinate Space</a> Up: <a href="TheInterfaceKit_Overview.html">The Interface Kit</a> Next: <a href="DragAndDrop.html">Drag And Drop</a> </div><div id="footerB"><div id="footerBL"><a href="TheInterfaceKit_The_Coordinate_Space.html" title="The Coordinate Space"><img src="./images/navigation/prev.png" alt="Prev" /></a> <a href="TheInterfaceKit_Overview.html" title="The Interface Kit"><img src="./images/navigation/up.png" alt="Up" /></a> <a href="DragAndDrop.html" title="Drag And Drop"><img src="./images/navigation/next.png" alt="Next" /></a></div><div id="footerBR"><div><a href="http://www.haiku-os.org"><img src="./images/People_24.png" alt="haiku-os.org" title="Visit The Haiku Website" /></a></div><div class="navighome" title="Home"><a accesskey="h" href="index.html"><img src="./images/navigation/home.png" alt="Home" /></a></div></div><div id="footerBC"><a href="http://www.access-company.com/home.html" title="ACCESS Co."><img alt="Access Company" src="./images/access_logo.png" /></a></div></div></div><div id="licenseFooter"><div id="licenseFooterBL"><a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/" title="Creative Commons License"><img alt="Creative Commons License" style="border-width:0" src="https://licensebuttons.net/l/by-nc-nd/3.0/88x31.png" /></a></div><div id="licenseFooterBR"><a href="./LegalNotice.html">Legal Notice</a></div><div id="licenseFooterBC"><span id="licenseText">This work is licensed under a
|
||
<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/">Creative
|
||
Commons Attribution-Non commercial-No Derivative Works 3.0 License</a>.</span></div></div></body></html>
|