haiku-website/static/legacy-docs/bebook/TheInterfaceKit_Character_E...

142 lines
13 KiB
HTML
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>The Be Book - System Overview - The Interface Kit</title><link rel="stylesheet" href="be_book.css" type="text/css" media="all" /><link rel="shortcut icon" type="image/vnd.microsoft.icon" href="./images/favicon.ico" /><!--[if IE]>
<link rel="stylesheet" type="text/css" href="be_book_ie.css" />
<![endif]--><meta name="generator" content="DocBook XSL Stylesheets V1.73.2" /><meta name="keywords" content="Access, BeOS, BeBook, API" /><link rel="start" href="index.html" title="The Be Book" /><link rel="up" href="TheInterfaceKit_Overview.html" title="The Interface Kit" /><link rel="prev" href="TheInterfaceKit_The_Coordinate_Space.html" title="The Coordinate Space" /><link rel="next" href="DragAndDrop.html" title="Drag And Drop" /></head><body><div id="header"><div id="headerT"><div id="headerTL"><a accesskey="p" href="TheInterfaceKit_The_Coordinate_Space.html" title="The Coordinate Space"><img src="./images/navigation/prev.png" alt="Prev" /></a> <a accesskey="u" href="TheInterfaceKit_Overview.html" title="The Interface Kit"><img src="./images/navigation/up.png" alt="Up" /></a> <a accesskey="n" href="DragAndDrop.html" title="Drag And Drop"><img src="./images/navigation/next.png" alt="Next" /></a></div><div id="headerTR"><div id="navigpeople"><a href="http://www.haiku-os.org"><img src="./images/People_24.png" alt="haiku-os.org" title="Visit The Haiku Website" /></a></div><div class="navighome" title="Home"><a accesskey="h" href="index.html"><img src="./images/navigation/home.png" alt="Home" /></a></div><div class="navigboxed" id="navigindex"><a accesskey="i" href="ClassIndex.html" title="Index">I</a></div><div class="navigboxed" id="naviglang" title="English">en</div></div><div id="headerTC">The Be Book - System Overview - The Interface Kit</div></div><div id="headerB">Prev: <a href="TheInterfaceKit_The_Coordinate_Space.html">The Coordinate Space</a>  Up: <a href="TheInterfaceKit_Overview.html">The Interface Kit</a>  Next: <a href="DragAndDrop.html">Drag And Drop</a></div><hr /></div><div class="section"><div xmlns="" xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><div xmlns:d="http://docbook.org/ns/docbook"><h2 xmlns="http://www.w3.org/1999/xhtml" class="title"><a id="TheInterfaceKit_Character_Encoding"></a>Character Encoding</h2></div></div></div><p>
The BeOS encodes characters using the UTF-8 transformation of Unicode
character values. Unicode is a standard encoding scheme for all the major
scripts of the world—including, among others, extended Latin,
Cyrillic, Greek, Devanagiri, Telugu, Hebrew, Arabic, Tibetan, and the
various character sets used by Chinese, Japanese, and Korean. It assigns
a unique and unambiguous 16-bit value to each character, making it
possible for characters from various languages to co-exist in the same
document. Unicode makes it simpler to write language-aware software
(though it doesn't solve all the problems). It also makes a wide variety
of symbols available to an application, even if it's not concerned with
covering more than one language.
</p><p>
Unicode's one disadvantage is that all characters have a width of 16
bits. Although 16 bits are necessary for a universal encoding system and
a fixed width for all characters is important for the standard, there are
many contexts in which byte-sized characters would be easier to work with
and take up less memory (besides being more familiar and backwards
compatible with existing code). UTF-8 is designed to address this problem.
</p><div class="section"><div xmlns="" xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><hr /><div xmlns:d="http://docbook.org/ns/docbook"><h3 xmlns="http://www.w3.org/1999/xhtml" class="title"><a id="TheInterfaceKit_Character_Encoding_UTF8"></a>UTF-8</h3></div></div></div><p>
UTF-8 stands for "UCS Transformation Format, 8-bit form" (and UCS stands
for "Universal Multiple-Octet Character Set," another name for Unicode).
UTF-8 transforms 16-bit Unicode values into a variable number of 8-bit
units. It takes advantage of the fact that for values equal to or less
than 0x007f, the Unicode character set matches the 7-bit ASCII character
set—in other words, Unicode adopts the ASCII standard, but encodes
each character in 16 bits. UTF-8 strips ASCII values back to 8 bits and
uses two or three bytes to encode Unicode values over 0x007f.
</p><p>
The high bit of each UTF-8 byte indicates the role it plays in the
encoding:
</p><ul class="itemizedlist"><li><p>
If the high bit is 0, the byte stands alone and encodes an ASCII
value.
</p></li><li><p>
If the high bit is 1, the byte is part of a multiple-byte character
representation.
</p></li></ul><p>
In addition, the first byte of a multibyte character indicates how many
bytes are in the encoding: The number of high bits that are set to 1
(before a bit is 0) is the number of bytes it takes to represent the
character. Therefore, the first byte of a multibyte character will always
have at least two high bits set. The other bytes in a multibyte encoding
have just one high bit set.
</p><p>
To illustrate, a character encoded in one UTF-8 byte will look like this
(where a '1' or a '0' indicates a control bit specified by the standard
and an 'x' is a bit that contributes to the character value):
</p><pre class="screen">
0xxxxxxx
</pre><p>
A character encoded in two bytes has the following arrangement of bits:
</p><pre class="screen">
110xxxxx 10xxxxxx
</pre><p>
And a character encoded in three bytes is laid out as follows:
</p><pre class="screen">
1110xxxx 10xxxxxx 10xxxxxx
</pre><p>
Note that any 16-bit value can be encoded in three UTF-8 bytes. However,
UTF-8 discards leading zeroes and always uses the fewest possible number
of bytes—so it can encode Unicode values less than 0x0080 in a
single byte and values less than 0x0800 in two bytes.
</p><p>
In addition to the codings illustrated above, UTF-8 takes four bytes to
translate a Unicode surrogate pair—two conjoined 16-bit values that
together encode a character that's not part of the standard. Surrogates
are extremely rare.
</p></div><div class="section"><div xmlns="" xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><hr /><div xmlns:d="http://docbook.org/ns/docbook"><h3 xmlns="http://www.w3.org/1999/xhtml" class="title"><a id="TheInterfaceKit_Character_Encoding_ASCII_Compatibility"></a>ASCII Compatibility</h3></div></div></div><p>
The UTF-8 encoding scheme has several advantages:
</p><ul class="itemizedlist"><li><p>
The single byte that encodes an ASCII value can't be confused with a
byte that's part of a multiple-byte encoding. You can test a UTF-8 byte
for an ASCII value without considering surrounding bytes; if there's a
match, you can be sure the byte is the ASCII character. UTF-8 is fully
compatible with ASCII.
</p></li><li><p>
The first (or only) byte of a character can't be confused with a byte
inside a multibyte sequence. It's simple to find where a character
begins. For example, this macro will do it:
</p><pre class="programlisting example cpp">#define <code class="function">BEGINS_CHAR</code>(<code class="parameter">byte</code>) ((<code class="parameter">byte</code> &amp; 0xc0) != 0x80)</pre></li><li><p>
The string functions in the standard C library—for example,
strcat() and strlen()—can operate on a UTF-8 string.
</p></li><li><p>
However, it's important to remember that strlen() measures the string
in bytes, not characters. Some Interface Kit functions, like
GetEscapements() in the BFont class, ask for a character count;
strlen() can't provide the answer. Instead, you need to do something
like this to count the characters in a string:
</p><pre class="programlisting example cpp"><span class="type">int32</span> <code class="varname">count</code> = 0;
while ( *<code class="varname">p</code> != '0' ) {
if ( <code class="function">BEGINS_CHAR</code>(*<code class="varname">p</code>) )
<code class="varname">count</code>++;
<code class="varname">p</code>++;
}</pre></li><li><p>
UTF-8 preserves the numerical ordering of Unicode character values.
String comparison functions—such as <code class="function">strcasecmp()</code>—will put
UTF-8 strings in the correct order.
</p></li><li><p>
However, you should be careful when using the string comparison
functions to order a set of UTF-8 strings. Unicode tries for a
universal encoding and orders characters in a way that's generically
correct, but it may not be correct for specific characters in specific
languages. (Because it follows ASCII, UTF-8 is correct for English.)
</p></li><li><p>
For European languages, UTF-8 generally yields more compact data
representations than would Unicode. Most of the characters in a string
can be encoded in a single byte. In many other cases, UTF-8 is no less
compact than Unicode.
</p></li></ul></div><div class="section"><div xmlns="" xmlns:d="http://docbook.org/ns/docbook" class="titlepage"><div><hr /><div xmlns:d="http://docbook.org/ns/docbook"><h3 xmlns="http://www.w3.org/1999/xhtml" class="title"><a id="TheInterfaceKit_Character_Encoding_UTF8_And_The_BeOS"></a>UTF-8 and the BeOS</h3></div></div></div><p>
The BeOS assumes UTF-8 encoding in most cases. For example, a <code class="constant">B_KEY_DOWN</code>
message reports the character that's mapped to the key the user pressed
as a UTF-8 value. That value is then passed as a string to
<a class="link" href="BView.html#BView_KeyDown" title="KeyDown()"><code class="methodname">KeyDown()</code></a>
along with the byte count:
</p><pre class="programlisting example cpp">virtual <span class="type">void</span> <code class="methodname">KeyDown</code>(<span class="type">const char*</span> <code class="parameter">bytes</code>, <span class="type">int32</span> <code class="parameter">numBytes</code>);</pre><p>
You can expect the bytes string to always contain at least one byte. And,
of course, you can test it for an ASCII value without caring whether it's
UTF-8:
</p><pre class="programlisting example cpp">if ( <code class="varname">bytes</code>[0] == <code class="constant">B_TAB</code> )
. . .</pre><p>
Similarly, objects that display text in the user interface—such as
window titles and button labels—expect to be passed UTF-8 encoded
strings, and hand you a UTF-8 string if you ask for the title or label.
These objects display text using a system font—either the system
plain font (<code class="varname">be_plain_font</code>) or the bold font (<code class="varname">be_bold_font</code>). The BFont
class allows other character encodings, which you may need to use in
limited circumstances from time to time, but the system fonts are
constrained to UTF-8 (<code class="constant">B_UNICODE_UTF8</code> encoding). The FontPanel preferences
application doesn't permit users to change the encoding of a system font.
</p><p>
Unicode and UTF-8 are documented in The Unicode Standard, Version 2.0,
published by Addison-Wesley. See that book for complete information on
Unicode and for a description of how UTF-8 encodes surrogate pairs.
</p></div></div><div id="footer"><hr /><div id="footerT">Prev: <a href="TheInterfaceKit_The_Coordinate_Space.html">The Coordinate Space</a>  Up: <a href="TheInterfaceKit_Overview.html">The Interface Kit</a>  Next: <a href="DragAndDrop.html">Drag And Drop</a> </div><div id="footerB"><div id="footerBL"><a href="TheInterfaceKit_The_Coordinate_Space.html" title="The Coordinate Space"><img src="./images/navigation/prev.png" alt="Prev" /></a> <a href="TheInterfaceKit_Overview.html" title="The Interface Kit"><img src="./images/navigation/up.png" alt="Up" /></a> <a href="DragAndDrop.html" title="Drag And Drop"><img src="./images/navigation/next.png" alt="Next" /></a></div><div id="footerBR"><div><a href="http://www.haiku-os.org"><img src="./images/People_24.png" alt="haiku-os.org" title="Visit The Haiku Website" /></a></div><div class="navighome" title="Home"><a accesskey="h" href="index.html"><img src="./images/navigation/home.png" alt="Home" /></a></div></div><div id="footerBC"><a href="http://www.access-company.com/home.html" title="ACCESS Co."><img alt="Access Company" src="./images/access_logo.png" /></a></div></div></div><div id="licenseFooter"><div id="licenseFooterBL"><a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/" title="Creative Commons License"><img alt="Creative Commons License" style="border-width:0" src="https://licensebuttons.net/l/by-nc-nd/3.0/88x31.png" /></a></div><div id="licenseFooterBR"><a href="./LegalNotice.html">Legal Notice</a></div><div id="licenseFooterBC"><span id="licenseText">This work is licensed under a
<a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/3.0/">Creative
Commons Attribution-Non commercial-No Derivative Works 3.0 License</a>.</span></div></div></body></html>