Unicode Demystified : A Practical Programmer's Guide to the Encoding Standard

Unicode Demystified : A Practical Programmer's Guide to the Encoding Standard

  • ただいまウェブストアではご注文を受け付けておりません。 ⇒古書を探す
  • 製本 Paperback:紙装版/ペーパーバック版/ページ数 853 p.
  • 言語 ENG
  • 商品コード 9780201700527
  • DDC分類 005.72

Full Description


Unicode is a critical enabling technology for developers who want to internationalize applications for global environments. But, until now, developers have had to turn to standards documents for crucial information on utilizing Unicode. In Unicode Demystified, one of IBM's leading software internationalization experts covers every key aspect of Unicode development, offering practical examples and detailed guidance for integrating Unicode 3.0 into virtually any application or environment. Writing from a developer's point of view, Rich Gillam presents a systematic introduction to Unicode's goals, evolution, and key elements. Gillam illuminates the Unicode standards documents with insightful discussions of character properties, the Unicode character database, storage formats, character sequences, Unicode normalization, character encoding conversion, and more. He presents practical techniques for text processing, locating text boundaries, searching, sorting, rendering text, accepting user input, and other key development tasks. Along the way, he offers specific guidance on integrating Unicode with other technologies, including Java, JavaScript, XML, and the Web. For every developer building internationalized applications, internationalizing existing applications, or interfacing with systems that already utilize Unicode.

Contents

Preface. I. UNICODE IN ESSENCE: AN ARCHITECTURAL OVERVIEW OF THE UNICODE STANDARD. 1. Language, Computers, and Unicode. What Unicode Is. What Unicode Isn't. The Challenge of Representing Text in Computers. What This Book Does. How This Book Is Organized. Part I: Unicode in Essence. Part II: Unicode in Depth. Part III: Unicode in Action. 2. A Brief History of Character Encoding. Prehistory. The Telegraph and Morse Code. The Teletypewriter and Baudot Code. Other Teletype and Telegraphy Codes. FIELDATA and ASCII. Hollerith and EBCDIC. Single-Byte Encoding Systems. Eight-Bit Encoding Schemes and the ISO 2022 Model. ISO 8859. Other 8-Bit Encoding Schemes. Character Encoding Terminology. Multiple-Byte Encoding Systems. East Asian Coded Character Sets. Character Encoding Schemes for East Asian Coded Character Sets. Other East Asian Encoding Systems. ISO 10646 and Unicode. How the Unicode Standard Is Maintained. 3. Architecture:Not Just a Pile of Code Charts. The Unicode Character-Glyph Model. Character Positioning. The Principle of Unification. Alternate-Glyph Selection. Multiple Representations. Flavors of Unicode. Character Semantics. Unicode Versions and Unicode Technical Reports. Unicode Standard Annexes. Unicode Technical Standards. Unicode Technical Reports. Draft and Proposed Draft Technical Reports. Superseded Technical Reports. Unicode Versions. Unicode Stability Policies. Arrangement of the Encoding Space. Organization of the Planes. The Basic Multilingual Plane. The Supplementary Planes. Noncharacter Code Point Values. Conforming to the Standard. General. Producing Text as Output. Interpreting Text from the Outside World. Passing Text Through. Drawing Text on the Screen or Other Output Devices. Comparing Character Strings. Summary. 4. Combining Character Sequences and Unicode Normalization. How Unicode Non-spacing Marks Work. Dealing Properly with Combining Character Sequences. Canonical Decompositions. Canonical Accent Ordering. Double Diacritics. Compatibility Decompositions. Singleton Decompositions. Hangul. Unicode Normalization Forms. Grapheme Clusters. 5. Character Properties and the Unicode Character Database. Where to Get the Unicode Character Database. The UNIDATA Directory. UnicodeData.txt. PropList.txt. General Character Properties. Standard Character Names. Algorithmically Derived Names. Control-Character Names. ISO 10646 Comments. Block and Script. General Category. Letters. Marks. Numbers. Punctuation. Symbols. Separators. Miscellaneous. Other Categories. Properties of Letters. SpecialCasing.txt. CaseFolding.txt. Properties of Digits, Numerals, and Mathematical Symbols. Layout-Related Properties. Bidirectional Layout. Mirroring. Arabic Contextual Shaping. East Asian Width. Line-Breaking Property. Normalization-Related Properties. Decomposition. Decomposition Type. Combining Class. Composition Exclusion List. Normalization Test File. Derived Normalization Properties. Grapheme Cluster-Related Properties. Unihan.txt. 6. Unicode Storage and Serialization Formats. A Historical Note. UTF-32. UTF-16 and the Surrogate Mechanism. Ending-ness and the Byte Order Mark. UTF-8. CESU-8. UTF-EBCDIC. UTF-7. Standard Compression Scheme for Unicode. BOCU. Detecting Unicode Storage Formats. II. UNICODE IN DEPTH: A GUIDED TOUR OF THE CHARACTER REPERTOIRE. 7. Scripts of Europe. The Western Alphabetic Scripts. The Latin Alphabet. The Latin-1 Characters. The Latin Extended A Block. The Latin Extended B Block. The Latin Extended Additional Block. The International Phonetic Alphabet. Diacritical Marks. Isolated Combining Marks. Spacing Modifier Letters. The Greek Alphabet. The Greek Block. The Greek Extended Block. The Coptic Alphabet. The Cyrillic Alphabet. The Cyrillic Block. The Cyrillic Supplementary Block. The Armenian Alphabet. The Georgian Alphabet. 8. Scripts of the Middle East. Bidirectional Text Layout. The Unicode Bidirectional Layout Algorithm. Inherent Directionality. Neutrals. Numbers. The Left-to-Right and Right-to-Left Marks. The Explicit Override Characters. The Explicit Embedding Characters. Mirroring Characters. Line and Paragraph Boundaries. Bidirectional Text in a Text-Editing Environment. The Hebrew Alphabet. The Hebrew Block. The Arabic Alphabet. The Arabic Block. Joiners and Non-joiners. The Arabic Presentation Forms B Block. The Arabic Presentation Forms A Block. The Syriac Alphabet. The Syriac Block. The Thaana Script. The Thaana Block. 9. Scripts of India and Southeast Asia. Devanagari. The Devanagari Block. Bengali. The Bengali Block. Gurmukhi. The Gurmukhi Block. Gujarati. The Gujarati Block. Oriya. The Oriya Block. Tamil. The Tamil Block. Telugu. The Telugu Block. Kannada. The Kannada Block. Malayalam. The Malayalam Block. Sinhala. The Sinhala Block. Thai. The Thai Block. Lao. The Lao Block. Khmer. The Khmer Block. Myanmar. The Myanmar Block. Tibetan. The Tibetan Block. The Philippine Scripts. 10. Scripts of East Asia. The Han Characters. Variant Forms of Han Characters. Han Characters in Unicode. The CJK Unified Ideographs Area. The CJK Unified Ideographs Extension A Area. The CJK Unified Ideographs Extension B Area. The CJK Compatibility Ideographs Block. The CJK Compatibility Ideographs Supplement Block. The Kangxi Radicals Block. The CJK Radicals Supplement Block. Ideographic Description Sequences. Bopomofo. The Bopomofo Block. The Bopomofo Extended Block. Japanese. The Hiragana Block. The Katakana Block. The Katakana Phonetic Extensions Block. The Kanbun Block. Korean. The Hangul Jamo Block. The Hangul Compatibility Jamo Block. The Hangul Syllables Area. Half-width and Full-width Characters. The Half-width and Full-width Forms Block. Vertical Text Layout. Ruby. The Interlinear Annotation Characters. Yi. The Yi Syllables Block. The Yi Radicals Block. 11. Scripts from Other Parts of the World. Mongolian. The Mongolian Block. Ethiopic. The Ethiopic Block. Cherokee. The Cherokee Block. Canadian Aboriginal Syllables. The Unified Canadian Aboriginal Syllabics Block. Historical Scripts. Runic. Ogham. Old Italic. Gothic. Deseret. 12. Numbers, Punctuation, Symbols, and Specials. Numbers. Western Positional Notation. Alphabetic Numerals. Roman Numerals. Han Characters as Numerals. Other Numeration Systems. Numeric Presentation Forms. National and Nominal Digit Shapes. Punctuation. Script-Specific Punctuation. The General Punctuation Block. The CJK Symbols and Punctuation Block. Spaces. Dashes and Hyphens. Quotation Marks, Apostrophes, and Similar-Looking Characters. Paired Punctuation. Dot Leaders. Bullets and Dots. Special Characters. Line and Paragraph Separators. Segment and Page Separators. Control Characters. Characters That Control Word Wrapping. Characters That Control Glyph Selection. The Grapheme Joiner. Bidirectional Formatting Characters. Deprecated Characters. Interlinear Annotation. The Object Replacement Character. The General Substitution Character. Tagging Characters. Noncharacters. Symbols Used with Numbers. Numeric Punctuation. Currency Symbols. Unit Markers. Math Symbols. Mathematical Alphanumeric Symbols. Other Symbols and Miscellaneous Characters. Musical Notation. Braille. Other Symbols. Presentation Forms. Miscellaneous Characters. III. UNICODE IN ACTION: IMPLEMENTING AND USING THE UNICODE STANDARD. 13 Techniques and Data Structures for Handling Unicode Text. Useful Data Structures. Testing for Membership in a Class. The Inversion List. Performing Set Operations on Inversion Lists. Mapping Single Characters to Other Values. Inversion Maps. The Compact Array. Two-Level Compact Arrays. Mapping Single Characters to Multiple Values. Exception Tables. Mapping Multiple Characters to Other Values. Exception Tables and Key Closure. Tries as Exception Tables. Tries as the Main Lookup Table. Single Versus Multiple Tables. 14. Conversions and Transformations. Converting Between Unicode Encoding Forms. Converting Between UTF-16 and UTF-32. Converting Between UTF-8 and UTF-32. Converting Between UTF-8 and UTF-16. Implementing Unicode Compression. Unicode Normalization. Canonical Decomposition. Compatibility Decomposition. Canonical Composition. Optimizing Unicode Normalization. Testing Unicode Normalization. Converting Between Unicode and Other Standards. Getting Conversion Information. Converting Between Unicode and Single-Byte Encodings. Converting Between Unicode and Multibyte Encodings. Other Types of Conversions. Handling Exceptional Conditions. Dealing with Differences in Encoding Philosophy. Choosing a Converter. Line-Break Conversion. Case Mapping and Case Folding. Case Mapping on a Single Character. Case Mapping on a String. Case Folding. Transliteration. 15 Searching and Sorting. The Basics of Language-Sensitive String Comparison. Multilevel Comparisons. Ignorable Characters. French Accent Sorting. Contracting Character Sequences. Expanding Characters. Context-Sensitive Weighting. Putting It All Together. Other Processes and Equivalences. Language-Sensitive Comparison on Unicode Text. Unicode Normalization. Reordering. A General Implementation Strategy. The Unicode Collation Algorithm. The Default UCA Sort Order. Alternate Weighting. Optimizations and Enhancements. Language-Insensitive String Comparison. Sorting. Collation Strength and Secondary Keys. Exposing Sort Keys. Minimizing Sort Key Length. Searching. The Boyer-Moore Algorithm. Using the Boyer-Moore Algorithm with Unicode. "Whole Word" Searches. Using Unicode with Regular Expressions. 16. Rendering and Editing. Line Breaking. Line-Breaking Properties. Implementing Boundary Analysis with Pair Tables. Implementing Boundary Analysis with State Machines. Performing Boundary Analysis Using a Dictionary. A Few More Thoughts on Boundary Analysis. Performing Line Breaking. Line Layout. Glyph Selection and Positioning. Font Technologies. Poor Man's Glyph Selection. Glyph Selection and Placement in AAT. Glyph Selection and Placement in OpenType. Special-Purpose Rendering Technology. Compound and Virtual Fonts. Special Text-Editing Considerations. Optimizing for Editing Performance. Accepting Text Input. Handling Arrow Keys. Handling Discontiguous Selection. Handling Multiple-Click Selection. 17. Unicode and Other Technologies. Unicode and the Internet. The W3C Character Model. XML. HTML and HTTP. URLs and Domain Names. Mail and Usenet. Unicode and Programming Languages. The Unicode Identifier Guidelines. Java. C and C++. Javascript and JScript. Visual Basic. Perl. ICU. Unicode and Operating Systems. Microsoft Windows. MacOS. Varieties of UNIX. Conclusion. Glossary. Bibliography. Index.