demoroniser man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

DEMORONISER(1)							DEMORONISER(1)

NAME
       demoroniser - correct moronic and gratuitously incompatible HTML gener‐
       ated by Microsoft applications

SYNOPSIS
       demoroniser [ -q ] [ -u ] [ -wcols ] [ infile ] [ outfile ]

DESCRIPTION
       Many slick, high profile corporate Web sites I visit seemed to  exhibit
       terrible grammar completely inconsistent with the obvious investment in
       graphics and design.  Apostrophes and quote marks were frequently omit‐
       ted,  and  every	 couple	 of  paragraphs	 words were run together which
       should have been separated by a punctuation mark of some kind.

       This remained a mystery to me until I wanted to convert a  presentation
       I'd  developed  in  1996	 using	Microsoft PowerPoint into a set of Web
       pages.  A friend was kind enough to run the presentation through Power‐
       Point's ``Save as HTML'' feature (I have abandoned all use of Microsoft
       products, so I did not have a current version of PowerPoint  which  in‐
       cludes  this  feature).	 When I got the PowerPoint-generated HTML back
       and viewed it in my browser, I discovered that it  contained  precisely
       the  same  grammatical errors I'd noted on so many Web sites, and which
       certainly were not present in my original presentation.

       A little detective work revealed that, as is usually the case when  you
       encounter something shoddy in the vicinity of a computer, Microsoft in‐
       competence and gratuitous incompatibility were to blame.	 Western  lan‐
       guage  HTML  documents  are written in the ISO 8859-1 Latin-1 character
       set, with a specified set of escapes for special characters.   Blithely
       ignoring	 this  prescription, as usual, Microsoft use their own "exten‐
       sion" to Latin-1, in which a variety of characters which do not	appear
       in Latin-1 are inserted in the range 0x82 through 0x95--this having the
       merit of being incompatible with both Latin-1 and  Unicode,  which  re‐
       serve this region for additional control characters.

       These  characters  include  open and close single and double quotes, em
       and en dashes, an ellipsis and a variety of other  things  you've  been
       dying for, such as a capital Y umlaut and a florin symbol.  Well, okay,
       you say, if Microsoft want to have their own little incompatible	 char‐
       acter set, why not?  Because it doesn't stop there--in their inimitable
       fashion (who would want to?)--they aggressively pollute the  Web	 pages
       of unknowing and innocent victims worldwide with these characters, with
       the result that the owners of these pages look like  semi-literate  mo‐
       rons  when their pages are viewed on non-Microsoft platforms (or on Mi‐
       crosoft platforms, for that matter, if the user	has  selected  as  the
       browser's  font one of the many TrueType fonts which do not include the
       incompatible Microsoft characters).

       You see, ``state of the art'' Microsoft	Office	applications  sport  a
       nifty  feature called ``smart quotes.''	(Rule of thumb--every time Mi‐
       crosoft use the word ``smart,'' be on the lookout for something	dumb).
       This  feature  is on by default in both Word and PowerPoint, and can be
       disabled only by finding the little box buried among the dozens of  be‐
       wildering  option  panels  these products contain.  If enabled, and you
       type the string,

		       "Halt," he cried, "this is the police!"

       ``smart quotes'' transforms the ASCII  quote  characters	 automatically
       into the incompatible Microsoft opening and closing quotes.  ASCII sin‐
       gle and double quotes are similarly transformed (even though ASCII  al‐
       ready contains apostrophe and single open quote characters), and double
       hyphens are replaced by the incompatible em dash	 symbol.   What	 other
       horrors	occur, I know not.  If the user notices this happening at all,
       their reaction might be ``Thank you Billy-boy--that looks ever so  much
       nicer,''	 not knowing they've been set up to look like a moron to folks
       all over the world.

       You see, when you export a document as text for hand-editing into HTML,
       or avail yourself of the ``Save as HTML'' features in newer versions of
       Office applications, these incompatible, Microsoft-specific  characters
       remain  in  place.   When viewed by a user on a non-Microsoft platform,
       they will not be displayed properly--most browsers seem	to  just  drop
       them,  as  opposed  to  including  a symbol indicating an undisplayable
       character.  Hence, the apparently ungrammatical text, which the	author
       of the page, editing on a Microsoft platform, will never be aware of.

       Having  no desire to hand-edit the HTML for a long presentation to cor‐
       rect a raft of Microsoft-induced incompatibilities, I wrote a Perl pro‐
       gram,  the  demoroniser, to transform Microsoft's ``junk HTML'' into at
       least a starting point for something I'd	 consider  presentable	on  my
       site.   In addition to replacing the incompatible characters with HTML-
       compliant equivalents wherever possible (a few rarely-encountered char‐
       acters  which can't be translated result in warning messages if encoun‐
       tered), the following sloppy or downright wrong HTML is corrected.

       ·	 The missing semicolon at the end of numeric character escapes
		 (=) is supplied.

       ·	 Numeric renderings of special characters (< > &) are
		 replaced with readable equivalents.

       ·	 Unquoted <table> tags containing non-alphanumeric  characters
		 are quoted.

       ·	 PowerPoint's  mis-nesting of <font> and <strong> tags is cor‐
		 rected.

       ·	 PowerPoint's boneheaded use of <ul> and </ul> tags to	accom‐
		 plish	paragraph  breaks is corrected and the proper <p> tags
		 inserted.

       ·	 Missing <tr> tags in text-only slides are inserted.

       ·	 Nugatory </p> tags are removed.

       ·	 Unmatched <li> tags in headings are removed.

       ·	 Idiot ``paragraph-long	 lines''  are  broken  into  something
		 suitable for editing with a normal text editor.

OPTIONS
       -q	 Quiet: don't print warnings for untranslated characters.

       -u	 Print how-to-call information and a summary of options.

       -wcols	 Wrap  output  lines  at  column  cols.	 By default, lines are
		 wrapped at column 72.	A cols	specification  of  0  disables
		 line  wrapping.   demoroniser attempts to wrap lines so as to
		 preserve their meaning.  Lines	 are  broken  at  white	 space
		 whenever  possible.   If  this	 cannot be done, a line longer
		 than the cols specification will remain in the output HTML.

BUGS
       demoroniser is a Perl script.  In order to use it, you must  have  Perl
       installed  on  your  system.  demoroniser was developed using Perl 4.0,
       patch level 36.

FILES
       If no outfile is specified, output is written to standard  output.   If
       no infile is specified, input is read from standard input.

SEE ALSO
       perl(1)

AUTHOR
	    John Walker
	    WWW:    http://www.fourmilab.ch/

       This program is in the public domain.

4th Berkeley Distribution	  16 SEP 2003			DEMORONISER(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net