Unicode man page on DigitalUNIX

Unicode man page on DigitalUNIX
Man page or keyword search:
man Server 12896 pages
apropos Keyword Search (all sections)
Output format
Unicode(5)							    Unicode(5)

NAME
       Unicode,	 unicode, universal.utf8, UCS-2, UCS-4, UTF-8, UTF-16, UTF-32,
       iso10646 - Support for the Unicode and ISO/IEC 10646 standards

DESCRIPTION
       The operating system provides locales and codeset converters that  sup‐
       port  the  following standards: The Unicode Standard, Version 3.0, Uni‐
       code, Inc., 2000 The Unicode Standard, Version 3.1, Unicode, Inc., 2001
       Information  Technology-Universal  Multiple-Octet  Coded Character Set,
       ISO/IEC 10646:2001

	      The Basic Multilingual Plane defined by this standard is identi‐
	      cal with the main body of Unicode character encoding.

       These standards define generalized character encoding rules that can be
       applied to characters in most  native  language	scripts.  The  Unicode
       standard	 specifies a universal character set (UCS). Version 3.0 of the
       Unicode standard contains definitions for 49,194	 characters  and  also
       includes	 a  Private  Use  Area for vendor- or user-defined characters.
       Version 3.1 of the Unicode standard adds 44,946 new  character  defini‐
       tions,  incorporates  UTF-32  (32-bit  encoding) into the standard, and
       adds three new planes beyond the 16-bit codespace  of  Plane  0	(Basic
       Multilingual  Plane).  Plane  1 (Supplementary Multilingual Plane) con‐
       tains code positions U+10000 to U+1FFFF; Plane 2	 (Supplementary	 Ideo‐
       graphic	Plane)	contains  code	positions U+20000 to U+2FFFF; Plane 14
       (Supplementary Special-Purpose Plane) contains code  positions  U+E0000
       to U+EFFFF.

       See  the	 Unicode web site at http://www.unicode.org/ for more informa‐
       tion on the Unicode standard.   See  the	 Unicode  ReadMe  document  in
       /usr/share/unidata/,  which describes the Unicode standard version cur‐
       rently supported on the operating system.

       The following list summarizes the main features of the Unicode  charac‐
       ter  set:  Characters  have properties, such as base, numeric, spacing,
       combination, and directionality. The Unicode  standard  provides	 rules
       for  ordering  characters  with different properties so that parsing of
       character sequences is unambiguous.  The relationship  between  Unicode
       characters and the glyphs in the native language script that users see,
       type, or print is not necessarily one-to-one. A glyph may be mapped  to
       a  single  abstract  character  or to a composed character. Conversely,
       more than one glyph can be mapped to a character.  Certain sequences of
       Unicode	characters in a text stream are transformed into other charac‐
       ters, called composed characters.  The ISO 8859-1 character  set	 occu‐
       pies  the  first	 256  code  positions (and the ASCII character set the
       first 128 positions) of the UCS.

       The Unicode and ISO/IEC 10646 standards specify a universal  repertoire
       of  characters  that  can be used by all major languages and that allow
       character units to be processed for all languages under the same set of
       rules.  Therefore,  system support for the universal character set does
       not need to include multiple algorithms (one or more per language)  for
       converting  between  file  code and internal process code. However, the
       two different character sizes (16-bit or	 32-bit)  that	the  standards
       support	require	 different  parsing schemes for data input and output.
       Universal character encoding that an implementation  parses  in	16-bit
       units  (2  octets) is known as UCS-2. Universal character encoding that
       an implementation parses in 32-bit units (4 octets) is known as	UCS-4.
       This  is the canonical ISO/IEC 10646 encoding that is in use on systems
       that can support the larger data unit size.

       Because UCS-2 is a subset of  UTF-16,  the  operating  system  supports
       UCS-2  with  UTF-16  codeset  converters. The operating system supports
       UCS-4 with both codeset converters and  locales.	 (Keep	in  mind  that
       UCS-2  cannot  be used to encode characters outside of the Basic Multi‐
       lingual Plane.)

       In terms of locales, the operating system  supports  both  Unicode  and
       dense  code.  The  two  types of locales differ in their manner of wide
       character encoding support. See l10n_intro(5) for information comparing
       the  two	 locale types and for information on switching between Unicode
       and dense code locales.

       The Unicode and ISO/IEC 10646 standards define a number of  transforma‐
       tion  formats for the universal character set (UTF-8 and UTF-32 are the
       preferred transformation formats for the operating system): UTF-8,  the
       standard method for transforming UCS-4 process encoding into a sequence
       of 8-bit bytes and ensuring interchange transparency for characters  in
       C0  code	 positions  (0	to  31), the SPACE (32) character, and the DEL
       (127) character

	      The operating system supports UTF-8 with both codeset converters
	      and locales.  UTF-7, an obsolete interchange format for environ‐
	      ments that strip the eighth bit from each byte

	      The operating system does not support UTF-7.  UTF-1, an obsolete
	      interchange  format  that	 is  similar to UTF-8 but also ensures
	      interchange transparency of characters in C1 code positions (128
	      to 159)

	      The operating system does not support UTF-1.  UTF-16, which uses
	      the surrogate character extension technique defined  by  Version
	      2.0  and later of the Unicode standard and represents characters
	      in 16-bit units

	      UTF-16 is a superset of UCS-2. As	 with  UCS-2,  UTF-16  encodes
	      characters in the range U+0000 to U+FFFF as single 16-bit units.
	      For characters in the range U+10000 to U+10FFFF,	UTF-16	trans‐
	      forms them into a surrogate pair. The result of this transforma‐
	      tion is that the high surrogate (the first of the	 pair)	is  in
	      the  range U+D800 to U+DBFF, while the low surrogate (the second
	      part of the pair) is in the range U+DC00 to  U+DFFF.  These  two
	      16-bit values represent a single character.

	      Although	UTF-16	does  not support representation of the entire
	      UCS-4 code space (including  private-use	ranges	for  character
	      values  above  U+10FFFF),	 it does supports all characters  that
	      have been currently defined for the languages  covered  by  both
	      standards.

	      Byte  orientation	 in file code can differ and, depending on the
	      platform on which the file was generated, can  be	 little-endian
	      (LE)  or	big-endian (BE).  UTF-16 uses a byte order mark (BOM),
	      which is not part of the file text data, to indicate byte orien‐
	      tation.  The  code point of the BOM is U+FEFF. The Unicode stan‐
	      dard also defines UTF-16LE and UTF-16BE, which are  specific  to
	      the little-endian and big-endian orientations, respectively, and
	      do not include a byte order mark.

	      The operating system supports  UTF-16,  UTF-16LE,	 and  UTF-16BE
	      through codeset converters. The codeset converter name, UCS-2 is
	      recognized as an alias for UTF-16*, but with a restricted reper‐
	      toire of characters.

					    Note

	      By  default,  the	 operating  system  uses  UTF-16  rather  than
	      UTF-16LE or UTF-16BE.

	      In an input file, the software first looks for a BOM. If	a  BOM
	      is  not  found,  the converter assumes UTF-16BE. This means that
	      you must explicitly specify UTF-16LE to the  converter  (convert
	      files manually) when UTF-16LE applies to an input file.

	      For  an  output file, the converter automatically inserts a BOM.
	      This means that you must explicitly specify UTF-16LE or UTF-16BE
	      (convert	files  manually) when you want conversion output to be
	      UTF-16LE or UTF-16BE rather than UTF-16.	UTF-32 allows  charac‐
	      ter representation in 4-byte encoding units

	      UTF-32  is a restricted subset of UCS-4. UTF-32 is restricted in
	      values to the range U+0000 to U+10FFFF, which precisely  matches
	      the  range  of  character values defined by UTF-16. Like UTF-16,
	      UTF-32 does not support private-use ranges for character	values
	      above U+10FFFF.

	      UTF-32  uses  a BOM to indicate little-endian or big-endian byte
	      orientation.  The Unicode standard  also	defines	 UTF-32LE  and
	      UTF-32BE, which are specific to the little-endian and big-endian
	      orientations, respectively, and do not include a	BOM.  As  with
	      UTF-16,  big-endian  is the default byte order when a BOM is not
	      generated.

	      UTF-32 is almost the same as UCS-4, so you can use UCS-4 codeset
	      converters  to process UTF-32. UCS-4 converter software includes
	      support for UTF-32, UTF-32LE, or UTF-32BE.

   Codeset Conversion
       Codeset converters are available to  convert  data  in  all  the	 major
       encoding	 formats that the operating system supports to and from UCS-2,
       UTF-16,	UCS-4,	and  UTF-8.  If	 the  worldwide	 support  subsets  are
       installed  on your system, you can enter the following commands to find
       the names of these converters: % cd /usr/lib/nls/loc/iconv % ls |  grep
       UTF % ls | grep UCS

       Among  the converters listed, you will find some that handle conversion
       of data in the code-page format used on PC  systems.  See  code_page(5)
       for  more  information  about  converting between codeset and code-page
       formats. You can use all codeset converters with the iconv command  and
       associated library functions.

					Note

       The mapping of Korean Hangul characters changed between Version 1.1 and
       Version 2.0 of the Unicode standard. By	default,  UTF-16,  UCS-4,  and
       UTF-8 conversion assumes Version 2.0 character mapping for Hangul char‐
       acters. Therefore, if data is in Version 1.1  format,  you  must	 first
       convert	the  data to Version 2.0 format before converting from UTF-16,
       UCS-4, or UTF-8 to an entirely different format.

       The format of a codeset converter name is  from-codeset_to-codeset.  In
       converter  names, the Version 1.1 codeset formats for UCS-2, UCS-4, and
       UTF-8 are  represented  by  UNICODE-1-1,	 UNICODE-1-1-UCS-4,  and  UNI‐
       CODE-1-1-UTF-8,	respectively. The Version 2.0 codeset names are repre‐
       sented by UTF-16, UCS-4, and UTF-8.

       For example, if Korean data is currently in UCS-4 Version  1.1  format,
       the  data  must	first be processed by the UNICODE-1-1-UCS-4_UCS-4 con‐
       verter before being processed by the UCS-4_deckorean converter.

       See iconv_intro(5) for general information on codeset conversion.

   Locales
       The following locales use UTF-32 as internal processing	code:  univer‐
       sal.UTF-8

	      This  locale  is used by applications. It converts data in UTF-8
	      file format to UCS-4 process code and can be used	 to  test  any
	      UCS-4  character	to  determine  if it is included in one of the
	      following classes defined	 for  the  LC_CTYPE  category:	alnum,
	      alpha,  blank,  cntrl, digit, graph, lower, print, punct, space,
	      upper, or xdigit.

	      In the universal.UTF-8  locale,  the  LC_MESSAGES,  LC_MONETARY,
	      LC_NUMERIC, and LC_TIME category definitions match those for the
	      POSIX (C) locale.	 language_territory.UTF-8

	      These locales limit classification information to the characters
	      in  a  particular	 native	 language,  make country-specific data
	      available to the application, and assume file data follows UTF-8
	      encoding	rules.	The  operating system locales that support the
	      euro monetary symbol use either the UTF-8 or ISO8859-15 codeset.
	      See euro(5) for more information.

					    Note

	      The  X  locale database file used by applications running in the
	      universal.UTF-8, en_US.UTF-8, or Asian locales  (Chinese,	 Japa‐
	      nese,  or Korean) contains font definitions that include all the
	      fonts used with the operating system. This enables  applications
	      under  en_US.UTF-8  to display all the font characters installed
	      with Worldwide Language Software (WLS). Applications  under  the
	      Asian  locales  display  all  the font characters installed with
	      WLS, except for ISO8859-2, -4, -5, -7, -8, -9, -15, and  TACTIS.
	      native_locale_name

	      These  locales  are  installed  in  the  default	Unicode	 path,
	      /usr/i18n/lib/nls/ucsloc/ and use UTF-32 as internal  processing
	      code.  However, they differ in the following ways: The file code
	      is specified by the codeset portion (for example, ISO8859-1)  of
	      native_locale_name.   Classification information is not provided
	      for the full set of UTF-32 characters, but only for those	 in  a
	      particular  native language (for example, French).  Country-spe‐
	      cific data is also available to the  application.	  The  LC_COL‐
	      LATE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and LC_TIME category
	      definitions   match   those   defined   in   native_locale_name.
	      native_locale_name@ucs4

	      These  locales  are  installed in /usr/i18n/lib/nls/loc/ and are
	      the  same	 as  the  native_locale_name  locales	installed   in
	      /usr/i18n/lib/nls/ucsloc/	 except	 that  they are not a complete
	      set of locales and will not be enhanced in  future  versions  of
	      the  operating  system. They are provided for compatibility with
	      existing applications. You cannot select @ucs4 locales from  the
	      CDE  login  menu;	 you  must specify the locale name in the LANG
	      environment variable.

       CDE desktop users can select locales  by	 choosing  names  followed  by
       (Unicode)  from the CDE language menu at session startup. In this case,
       the locale setting applies by default to all  applications  run	during
       the CDE session.

   Unicode Character Database
       For  the	 convenience  of  programmers, the source file for the Unicode
       character database is available on line. This source file  is  the  one
       used  to	 build	the  locales  provided	in  optional  software subsets
       included with the  operating  system  product.  When  the  locales  are
       installed  on  your  system, both the Unicode character database and an
       associated ReadMe file are also	installed  in  the  /usr/share/unidata
       directory.   The	 ReadMe	 file  discusses the character properties sup‐
       ported by Unicode.

   Font Support
       The operating system provides the following types of bitmap  fonts  for
       UCS characters: Public domain Unicode fonts:

	      -etl-fixed-medium-r-normal--14-140-72-72-c-70-iso10646-1	 -etl-
	      fixed-medium-r-normal--16-160-72-72-c-80-iso10646-1  -etl-fixed-
	      medium-r-normal--24-240-72-72-c-120-iso10646-1  Composite	 fonts
	      that the libfr_FGC font  renderer	 creates  by  combining	 fonts
	      available	 for  other  codesets  Two sets of monospaced fonts (a
	      16x18 pixel set and a 24x24 pixel set) for  UTF-8	 locales  with
	      the  following CDE font aliases (where -n is -1, -2, -3, -4, -5,
	      -7, -8,- 9, or -15):

	      -dt-interface-*-*-*-*-*-*-*-*-*-*-*-iso8859-n@mono    -dt-inter‐
	      face-*-*-*-*-*-*-*-*-*-*-*-iso10646-1@mono

       These  fonts  currently	cover  only a subset of the characters in UCS.
       Each of the ETL public domain fonts supports about 1000 characters, but
       does  not  include any characters for Chinese, Japanese, or Korean. The
       composite fonts created by the font renderer are	 generated  only  from
       fonts  available for the ISO 8859-1 (Latin-1) and ISO 8859-15 (Latin-9)
       codesets.

       See iso8859-1(5) and iso8859-15(5) for the names of fonts available for
       Latin-1 and Latin-9 characters. The Latin-9 fonts, which include glyphs
       for the euro character, provide the best support for the	 language_ter‐
       ritory.UTF-8 locales, which also support this character.

       See  i18n_printing(5)  and wwpsof(8) for information on printer support
       and converting bitmap font encoding to PostScript.

SEE ALSO
       Commands: locale(1), wwpsof(8)

       Others:	 ascii(5),    code_page(5),    iso8859-1(5),	iso8859-15(5),
       i18n_intro(5), i18n_printing(5), iconv_intro(5), l10n_intro(5)

       Using International Software

								    Unicode(5)
[top]

List of man pages available for DigitalUNIX

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome