pstotext man page on DigitalUNIX

Man page or keyword search:  
man Server   12896 pages
apropos Keyword Search (all sections)
Output format
DigitalUNIX logo
[printable version]

pstotext(1)							   pstotext(1)

NAME
       pstotext - extract ASCII text from a PostScript or PDF file

SYNTAX
       pstotext [option|pathname]...

       where option includes:

       -cork
       -landscape
       -landscapeOther
       -portrait
       -
       -output file
       -gs command
       -debug
       -bboxes

DESCRIPTION
       pstotext reads one or more PostScript or PDF files, and writes to stan‐
       dard output a representation of the plain text that would be  displayed
       if  the	PostScript  file were printed.	As is described in the DETAILS
       section below, this representation is only an approximation.  Neverthe‐
       less,  it  is  often  useful  for  information retrieval (e.g., running
       grep(1) or building a full-text index) or to recover the	 text  from  a
       PostScript file whose source you have lost.

       pstotext	 calls	Ghostscript,  and requires Aladdin Ghostscript version
       3.51 or newer.  Ghostscript must be invokable  on  the  current	search
       path  as	 gs.  Alternatively, you can use the -gs option to specify the
       command (pathname and options) to run  Ghostscript.   For  example,  on
       Windows you might use -gs "c:\gs\gswin32c.exe -Ic:\gs;c:\gs\fonts".

       pstotext	 reads	and  processes	its  command  line from left to right,
       ignoring the case of options.  When it encounters a pathname, it	 opens
       the  file  and  expects	to  find  a  PostScript job or PDF document to
       process.	 The option - means to read and process a PostScript job  from
       standard	 input.	  If no - or pathname arguments are encountered, psto‐
       text reads a PostScript job from standard input. (PDF documents require
       random  access,	hence cannot be read from standard input.) You can use
       the -output option to specify an output file  (remember	to  invoke  it
       before the input file); otherwise pstotext writes to standard output.

       The  option  -cork  is  only  relevant for PostScript files produced by
       dvips from TeX or LaTeX documents; it tells pstotext to	use  the  Cork
       encoding	 (known	 as T1 in LaTeX) rather than the old TeX text encoding
       (known as OT1 in LaTeX). Unfortunately files produced  by  dvips	 don't
       distinguish which font encodings were used.

       The options -landscape and -landscapeOther should be used for documents
       that must be rotated 90 degrees clockwise or counterclockwise,  respec‐
       tively, in order to be readable.

       The options -debug and -bboxes are mostly of use for the maintainers of
       pstotext.  -debug shows Ghostscript output and error messages.  -bboxes
       outputs one word per line with bounding box information.

DETAILS
       pstotext	 does  its  work  by  telling Ghostscript to load a PostScript
       library that causes it to write	to  its	 standard  output  information
       about  each  string rendered by a PostScript job or PDF document.  This
       information includes the characters of the  string,  and	 enough	 addi‐
       tional  information  to	approximate  the  string's bounding rectangle.
       pstotext post-processes this information	 and  outputs  a  sequence  of
       words delimited by space, newline, and formfeed.

       pstotext outputs words in the same sequence as they are rendered by the
       document.  This usually, but not always, follows the order that a human
       would  read the words on a page.	 Within this sequence, words are sepa‐
       rated by either space or newline depending on whether or not they  fall
       on the same line.  Each page is terminated with a formfeed.  If you use
       the incorrect  option  from  the	 set  {-portrait,  -landscape,	-land‐
       scapeOther}, pstotext is likely to substitute newline for space.

       A  PostScript  job  or  PDF  document often renders one word as several
       strings in order to get correct spacing	between	 particular  pairs  of
       characters.  pstotext does its best to assemble these strings back into
       words, using a simple heuristic: strings separated  by  a  distance  of
       less  than 0.3 times the minimum of the average character widths in the
       two strings are considered to be part of the same word.	Note that this
       typically  causes  leading  and	trailing  punctuation characters to be
       included with a word.

       The PostScript language provides a flexible encoding  scheme  by	 which
       character  codes	 in strings select specific characters (symbols), so a
       PostScript job is free to use any character code.  On the  other	 hand,
       pstotext	 always translates to the ISO 8859-1 (Latin-1) character code,
       which is an extension to ASCII covering most of	the  Western  European
       languages.  When a character isn't present in ISO 8859-1, pstotext uses
       a sequence of characters, e.g.,	"---"  for  em	dash  or  "A\226"  for
       Abreve.	pstotext can be fooled by a font whose Encoding vector doesn't
       follow Adobe's conventions, but it contains heuristics allowing	it  to
       handle a wide variety of misbehaving fonts.

       (pstotext no longer translates hyphen (\255) to minus (\055).)

AUTHOR
       Andrew Birrell (PostScript libraries), Paul McJones (application), Rus‐
       sell Lang (Windows and OS/2 adaptation), and Hunter Goatley (VMS	 adap‐
       tation).

SEE ALSO
       pstotext	 incorporates  technology originally developed for the Virtual
       Paper project at SRC; see  http://www.research.digital.com/SRC/virtual‐
       paper/.

       As  mentioned  above,  pstotext	invokes	 Ghostscript.	See  gs(1)  or
       http://www.cs.wisc.edu/~ghost/.

COPYRIGHT
       Copyright 1995-8 Digital Equipment Corporation.
       Distributed only by permission.
       See file pstotext.txt for details.

       Last modified on Sat Feb	 5 21:00:00 AEST 2000 by rjl
	    modified on Fri Jun	 5 14:02:37 PDT 1998 by mcjones
	    modified on Wed Jun	 7 17:47:56 PDT 1995 by birrell

       This file was generated automatically by mtex software;	see  the  mtex
       home page at http://www.research.digital.com/SRC/mtex/.

								   pstotext(1)
[top]

List of man pages available for DigitalUNIX

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net