htmlparse man page on Ubuntu

Man page or keyword search:  
man Server   6591 pages
apropos Keyword Search (all sections)
Output format
Ubuntu logo
[printable version]

htmlparse(3tcl)			  HTML Parser		       htmlparse(3tcl)

______________________________________________________________________________

NAME
       htmlparse - Procedures to parse HTML strings

SYNOPSIS
       package require Tcl  8.2

       package require struct::stack  1.3

       package require cmdline	1.1

       package require htmlparse  ?1.2?

       ::htmlparse::parse  ?-cmd  cmd?	?-vroot	 tag? ?-split n? ?-incvar var?
       ?-queue q? html

       ::htmlparse::debugCallback ?clientdata?	tag  slash  param  textBehind‐
       TheTag

       ::htmlparse::mapEscapes html

       ::htmlparse::2tree html tree

       ::htmlparse::removeVisualFluff tree

       ::htmlparse::removeFormDefs tree

_________________________________________________________________

DESCRIPTION
       The htmlparse package provides commands that allow libraries and appli‐
       cations to parse HTML in	 a  string  into  a  representation  of	 their
       choice.

       The following commands are available:

       ::htmlparse::parse  ?-cmd  cmd?	?-vroot	 tag? ?-split n? ?-incvar var?
       ?-queue q? html
	      This command is the basic parser for  HTML.  It  takes  an  HTML
	      string,  parses  it  and	invokes a command prefix for every tag
	      encountered. It is not necessary for the HTML to	be  valid  for
	      this parser to function. It is the responsibility of the command
	      invoked for every tag to check this. Another  responsibility  of
	      the  invoked command is the handling of tag attributes and char‐
	      acter entities (escaped characters). The parser provides the un-
	      interpreted  tag attributes to the invoked command to aid in the
	      former, and the package at  large	 provides  a  helper  command,
	      ::htmlparse::mapEscapes,	to  aid in the handling of the latter.
	      The parser does ignore  leading  DOCTYPE	declarations  and  all
	      valid HTML comments it encounters.

	      All  information	beyond the HTML string itself is specified via
	      options, these are explained below.

	      To help understand the options, some more background information
	      about the parser.

	      It  is  capable  of detecting incomplete tags in the HTML string
	      given to it. Under normal	 circumstances	this  will  cause  the
	      parser  to  throw an error, but if the option -incvar is used to
	      specify a global (or namespace) variable, the parser will	 store
	      the  incomplete  part  of	 the input into this variable instead.
	      This will aid greatly in the handling of incrementally  arriving
	      HTML,  as	 the  parser will handle whatever it can and defer the
	      handling of the incomplete part until more data has arrived.

	      Another feature of the parser are	 its  two  possible  modes  of
	      operation.  The normal mode is activated if the option -queue is
	      not present on the command line invoking the parser.  If	it  is
	      present, the parser will go into the incremental mode instead.

	      The main difference is that a parser in normal mode will immedi‐
	      ately invoke the command prefix for each tag it  encounters.  In
	      incremental  mode	 however  the parser will generate a number of
	      scripts which invoke the command prefix for groups  of  tags  in
	      the  HTML	 string	 and then store these scripts in the specified
	      queue. It is then the responsibility of the caller of the parser
	      to ensure the execution of the scripts in the queue.

	      Note:  The  queue	 object given to the parser has to provide the
	      same interface as the queue defined in tcllib  ->	 struct.  This
	      means, for example, that all queues created via that tcllib mod‐
	      ule can be immediately used here. Still, the queue doesn't  have
	      to  come	from tcllib -> struct as long as the same interface is
	      provided.

	      In both modes the parser will return  an	empty  string  to  the
	      caller.

	      The  -split  option may be given to a parser in incremental mode
	      to specify the size of the groups it creates.  In	 other	words,
	      -split  5	 means	that each of the generated scripts will invoke
	      the command prefix for 5 consecutive tags in the HTML string.  A
	      parser in normal mode will ignore this option and its value.

	      The option -vroot specifies a virtual root tag. A parser in nor‐
	      mal mode will invoke  the	 command  prefix  for  it  immediately
	      before  and  after it processes the tags in the HTML, thus simu‐
	      lating that the HTML string is enclosed in  a  <vroot>  </vroot>
	      combination. In incremental mode however the parser is unable to
	      provide the closing virtual root as  it  never  knows  when  the
	      input  is	 complete.  In this case the first script generated by
	      each invocation of the parser will contain an invocation of  the
	      command  prefix  for the virtual root as its first command.  The
	      following options are available:

	      -cmd cmd
		     The command prefix to invoke for every tag	 in  the  HTML
		     string. Defaults to ::htmlparse::debugCallback.

	      -vroot tag
		     The  virtual  root	 tag  to add around the HTML in normal
		     mode. In incremental mode it is the  first	 tag  in  each
		     chunk processed by the parser, but there will be no clos‐
		     ing tags. Defaults to hmstart.

	      -split n
		     The size of the groups produced by	 an  incremental  mode
		     parser. Ignored when in normal mode. Defaults to 10. Val‐
		     ues <= 0 are not allowed.

	      -incvar var
		     The name of the variable where to	store  any  incomplete
		     HTML  into.  This	makes  most  sense for the incremental
		     mode. The parser will throw an error if  it  sees	incom‐
		     plete  HTML  and  has no place to store it to. This makes
		     sense for the  normal  mode.  Only	 incomplete  tags  are
		     detected,	not  missing  tags.  Optional, defaults to 'no
		     variable'.

	      Interface to the command prefix
		     In normal mode the parser will invoke the command	prefix
		     with four arguments appended. See ::htmlparse::debugCall‐
		     back for a description.

		     In incremental mode, however, the generated scripts  will
		     invoke  the  command prefix with five arguments appended.
		     The last four of these are the same which were  mentioned
		     above.  The  first	 is a placeholder string (@win@) for a
		     clientdata value to be supplied later during  the	actual
		     execution	of  the	 generated scripts. This could be a tk
		     window path, for example. This allows the	user  of  this
		     package  to  preprocess  HTML  strings without committing
		     them to a specific window, object, whatever during	 pars‐
		     ing.  This	 connection can be made later. This also means
		     that it  is  possible  to	cache  preprocessed  HTML.  Of
		     course,  nothing  prevents	 the  user  of the parser from
		     replacing the placeholder with an empty string.

       ::htmlparse::debugCallback ?clientdata?	tag  slash  param  textBehind‐
       TheTag
	      This  command  is	 the  standard	callback used by the parser in
	      ::htmlparse::parse if none was specified by the user. It	simply
	      dumps  its  arguments  to stdout.	 This callback can be used for
	      both normal and incremental mode of the calling parser. In other
	      words,  it  accepts  four or five arguments. The last four argu‐
	      ments are described below. The optional fifth argument  contains
	      the  clientdata  value  passed  to  the  callback by a parser in
	      incremental mode. All callbacks have to follow the signature  of
	      this  command  in the last four arguments, and callbacks used in
	      incremental parsing have to follow this signature	 in  the  last
	      five arguments.

	      The  first argument, clientdata, is optional and present only if
	      this command is invoked by a parser in incremental mode. It con‐
	      tains whatever the user of this package wishes.

	      The  second argument, tag, contains the name of the tag which is
	      currently processed by the parser.

	      The third argument, slash, is either empty or contains  a	 slash
	      character. It allows the callback to distinguish between opening
	      (slash is empty) and closing tags (slash contains a slash	 char‐
	      acter).

	      The  fourth argument, param, contains the un-interpreted list of
	      parameters to the tag.

	      The fifth and last argument, textBehindTheTag, contains the text
	      found by the parser behind the tag named in tag.

       ::htmlparse::mapEscapes html
	      This  command  takes  a  HTML  string,  substitutes  all	escape
	      sequences with their actual  characters  and  then  returns  the
	      resulting	 string.   HTML	 strings  which	 do not contain escape
	      sequences are returned unchanged.

       ::htmlparse::2tree html tree
	      This command is a wrapper around ::htmlparse::parse which	 takes
	      an  HTML string (in html) and converts it into a tree containing
	      the logical structure of the parsed document. The	 name  of  the
	      tree  is given to the command as its second argument (tree). The
	      command does not generate the tree by itself  but	 expects  that
	      the  caller provided it with an existing and empty tree. It also
	      expects that the specified tree object follows the  same	inter‐
	      face  as the tree object in tcllib -> struct. It doesn't have to
	      be from tcllib -> struct, but it must provide  the  same	inter‐
	      face.

	      The  internal callback does some basic checking of HTML validity
	      and tries to recover from the most  basic	 errors.  The  command
	      returns  the  contents  of its second argument. Side effects are
	      the creation and manipulation of a tree object.

	      Each node in the generated tree represent one tag in the	input.
	      The name of the tag is stored in the attribute type of the node.
	      Any html attributes coming with the tag are stored unmodified in
	      the  attribute data of the tag. In other words, the command does
	      not parse html attributes into their names and values.

	      If a tag contains text its  node	will  have  children  of  type
	      PCDATA  containing  this	text.  The  text will be stored in the
	      attribute data of these children.

       ::htmlparse::removeVisualFluff tree
	      This command walks a tree as generated by ::htmlparse::2tree and
	      removes all the nodes which represent visual tags and not struc‐
	      tural ones. The purpose of the command is to make the tree  eas‐
	      ier  to  navigate without getting bogged down in visual informa‐
	      tion not relevant to the search. Its only argument is  the  name
	      of the tree to cut down.

       ::htmlparse::removeFormDefs tree
	      Like  ::htmlparse::removeVisualFluff this command is here to cut
	      down on the size of the tree as generated by ::htmlparse::2tree.
	      It  removes  all nodes representing forms and form elements. Its
	      only argument is the name of the tree to cut down.

BUGS, IDEAS, FEEDBACK
       This document, and the package it describes, will  undoubtedly  contain
       bugs  and other problems.  Please report such in the category htmlparse
       of	the	  Tcllib       SF	Trackers       [http://source‐
       forge.net/tracker/?group_id=12883].   Please  also report any ideas for
       enhancements you may have for either package and/or documentation.

SEE ALSO
       struct::tree

KEYWORDS
       html, parsing, queue, tree

CATEGORY
       Text processing

htmlparse			      1.2		       htmlparse(3tcl)
[top]

List of man pages available for Ubuntu

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net