Text::Unidecode man page on Hurd

Man page or keyword search:  
man Server   6387 pages
apropos Keyword Search (all sections)
Output format
Hurd logo
[printable version]

Text::Unidecode(3pm)  User Contributed Perl Documentation Text::Unidecode(3pm)

NAME
       Text::Unidecode -- US-ASCII transliterations of Unicode text

SYNOPSIS
	 use utf8;
	 use Text::Unidecode;
	 print unidecode(
	   "\x{5317}\x{4EB0}\n"
	    # those are the Chinese characters for Beijing
	 );

	 # That prints: Bei Jing

DESCRIPTION
       It often happens that you have non-Roman text data in Unicode, but you
       can't display it -- usually because you're trying to show it to a user
       via an application that doesn't support Unicode, or because the fonts
       you need aren't accessible.  You could represent the Unicode characters
       as "???????" or "\15BA\15A0\1610...", but that's nearly useless to the
       user who actually wants to read what the text says.

       What Text::Unidecode provides is a function, "unidecode(...)" that
       takes Unicode data and tries to represent it in US-ASCII characters
       (i.e., the universally displayable characters between 0x00 and 0x7F).
       The representation is almost always an attempt at transliteration --
       i.e., conveying, in Roman letters, the pronunciation expressed by the
       text in some other writing system.  (See the example in the synopsis.)

       Unidecode's ability to transliterate is limited by two factors:

       * The amount and quality of data in the original
	   So if you have Hebrew data that has no vowel points in it, then
	   Unidecode cannot guess what vowels should appear in a pronouncia‐
	   tion.  S f y hv n vwls n th npt, y wn't gt ny vwls n th tpt.	 (This
	   is a specific application of the general principle of "Garbage In,
	   Garbage Out".)

       * Basic limitations in the Unidecode design
	   Writing a real and clever transliteration algorithm for any single
	   language usually requires a lot of time, and at least a passable
	   knowledge of the language involved.	But Unicode text can convey
	   more languages than I could possibly learn (much less create a
	   transliterator for) in the entire rest of my lifetime.  So I put a
	   cap on how intelligent Unidecode could be, by insisting that it
	   support only context-insensitive transliteration.  That means miss‐
	   ing the finer details of any given writing system, while still
	   hopefully being useful.

       Unidecode, in other words, is quick and dirty.  Sometimes the output is
       not so dirty at all: Russian and Greek seem to work passably; and while
       Thaana (Divehi, AKA Maldivian) is a definitely non-Western writing sys‐
       tem, setting up a mapping from it to Roman letters seems to work pretty
       well.  But sometimes the output is very dirty: Unidecode does quite
       badly on Japanese and Thai.

       If you want a smarter transliteration for a particular language than
       Unidecode provides, then you should look for (or write) a translitera‐
       tion algorithm specific to that language, and apply it instead of (or
       at least before) applying Unidecode.

       In other words, Unidecode's approach is broad (knowing about dozens of
       writing systems), but shallow (not being meticulous about any of them).

FUNCTIONS
       Text::Unidecode provides one function, "unidecode(...)", which is
       exported by default.  It can be used in a variety of calling contexts:

       "$out = unidecode($in);" # scalar context
	   This returns a copy of $in, transliterated.

       "$out = unidecode(@in);" # scalar context
	   This is the same as "$out = unidecode(join '', @in);"

       "@out = unidecode(@in);" # list context
	   This returns a list consisting of copies of @in, each transliter‐
	   ated.  This is the same as "@out = map scalar(unidecode($_)), @in;"

       "unidecode(@items);" # void context
       "unidecode(@bar, $foo, @baz);" # void context
	   Each item on input is replaced with its transliteration.  This is
	   the same as "for(@bar, $foo, @baz) { $_ = unidecode($_) }"

       You should make a minimum of assumptions about the output of "unide‐
       code(...)".  For example, if you assume an all-alphabetic (Unicode)
       string passed to "unidecode(...)" will return an all-alphabetic string,
       you're wrong -- some alphabetic Unicode characters are transliterated
       as strings containing punctuation (e.g., the Armenian letter at 0x0539
       currently transliterates as "T`".

       However, these are the assumptions you can make:

       ·   Each character 0x0000 - 0x007F transliterates as itself.  That is,
	   "unidecode(...)" is 7-bit pure.

       ·   The output of "unidecode(...)" always consists entirely of US-ASCII
	   characters -- i.e., characters 0x0000 - 0x007F.

       ·   All Unicode characters translate to a sequence of (any number of)
	   characters that are newline ("\n") or in the range 0x0020-0x007E.
	   That is, no Unicode character translates to "\x01", for example.
	   (Altho if you have a "\x01" on input, you'll get a "\x01" in out‐
	   put.)

       ·   Yes, some transliterations produce a "\n" -- but just a few, and
	   only with good reason.  Note that the value of newline ("\n")
	   varies from platform to platform -- see "perlport" in perlport.

       ·   Some Unicode characters may transliterate to nothing (i.e., empty
	   string).

       ·   Very many Unicode characters transliterate to multi-character
	   sequences.  E.g., Han character 0x5317 transliterates as the four-
	   character string "Bei ".

       ·   Within these constraints, I may change the transliteration of char‐
	   acters in future versions.  For example, if someone convinces me
	   that the Armenian letter at 0x0539, currently transliterated as
	   "T`", would be better transliterated as "D", I may well make that
	   change.

DESIGN GOALS AND CONSTRAINTS
       Text::Unidecode is meant to be a transliterator-of-last resort, to be
       used once you've decided that you can't just display the Unicode data
       as is, and once you've decided you don't have a more clever, language-
       specific transliterator available.  It transliterates context-insensi‐
       tively -- that is, a given character is replaced with the same US-ASCII
       (7-bit ASCII) character or characters, no matter what the surrounding
       character are.

       The main reason I'm making Text::Unidecode work with only context-
       insensitive substitution is that it's fast, dumb, and straightforward
       enough to be feasable.  It doesn't tax my (quite limited) knowledge of
       world languages.	 It doesn't require me writing a hundred lines of code
       to get the Thai syllabification right (and never knowing whether I've
       gotten it wrong, because I don't know Thai), or spending a year trying
       to get Text::Unidecode to use the ChaSen algorithm for Japanese, or
       trying to write heuristics for telling the difference between Japanese,
       Chinese, or Korean, so it knows how to transliterate any given Uni-Han
       glyph.  And moreover, context-insensitive substitution is still mostly
       useful, but still clearly couldn't be mistaken for authoritative.

       Text::Unidecode is an example of the 80/20 rule in action -- you get
       80% of the usefulness using just 20% of a "real" solution.

       A "real" approach to transliteration for any given language can involve
       such increasingly tricky contextual factors as these

       The previous / preceding character(s)
	   What a given symbol "X" means, could depend on whether it's fol‐
	   lowed by a consonant, or by vowel, or by some diacritic character.

       Syllables
	   A character "X" at end of a syllable could mean something different
	   from when it's at the start -- which is especially problematic when
	   the language involved doesn't explicitly mark where one syllable
	   stops and the next starts.

       Parts of speech
	   What "X" sounds like at the end of a word, depends on whether that
	   word is a noun, or a verb, or what.

       Meaning
	   By semantic context, you can tell that this ideogram "X" means
	   "shoe" (pronounced one way) and not "time" (pronounced another),
	   and that's how you know to transliterate it one way instead of the
	   other.

       Origin of the word
	   "X" means one thing in loanwords and/or placenames (and derivatives
	   thereof), and another in native words.

       "It's just that way"
	   "X" normally makes the /X/ sound, except for this list of seventy
	   exceptions (and words based on them, sometimes indirectly).	Or:
	   you never can tell which of the three ways to pronounce "X" this
	   word actually uses; you just have to know which it is, so keep a
	   dictionary on hand!

       Language
	   The character "X" is actually used in several different languages,
	   and you have to figure out which you're looking at before you can
	   determine how to transliterate it.

       Out of a desire to avoid being mired in any of these kinds of contex‐
       tual factors, I chose to exclude all of them and just stick with con‐
       text-insensitive replacement.

TODO
       Things that need tending to are detailed in the TODO.txt file, included
       in this distribution.  Normal installs probably don't leave the
       TODO.txt lying around, but if nothing else, you can see it at
       http://search.cpan.org/search?dist=Text::Unidecode

MOTTO
       The Text::Unidecode motto is:

	 It's better than nothing!

       ...in both meanings: 1) seeing the output of "unidecode(...)" is better
       than just having all font-unavailable Unicode characters replaced with
       "?"'s, or rendered as gibberish; and 2) it's the worst, i.e., there's
       nothing that Text::Unidecode's algorithm is better than.

CAVEATS
       If you get really implausible nonsense out of "unidecode(...)", make
       sure that the input data really is a utf8 string.  See "perlunicode" in
       perlunicode.

THANKS
       Thanks to Harald Tveit Alvestrand, Abhijit Menon-Sen, and Mark-Jason
       Dominus.

SEE ALSO
       Unicode Consortium: http://www.unicode.org/

       Geoffrey Sampson.  1990.	 Writing Systems: A Linguistic Introduction.
       ISBN: 0804717567

       Randall K. Barry (editor).  1997.  ALA-LC Romanization Tables:
       Transliteration Schemes for Non-Roman Scripts.  ISBN: 0844409405 [ALA
       is the American Library Association; LC is the Library of Congress.]

       Rupert Snell.  2000.  Beginner's Hindi Script (Teach Yourself Books).
       ISBN: 0658009109

COPYRIGHT AND DISCLAIMERS
       Copyright (c) 2001 Sean M. Burke. All rights reserved.

       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

       This program is distributed in the hope that it will be useful, but
       without any warranty; without even the implied warranty of mer‐
       chantability or fitness for a particular purpose.

       Much of Text::Unidecode's internal data is based on data from The Uni‐
       code Consortium, with which I am unafiliated.

AUTHOR
       Sean M. Burke "sburke@cpan.org"

perl v5.8.8			  2008-03-01		  Text::Unidecode(3pm)
[top]

List of man pages available for Hurd

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net