dbacl man page on DragonFly

dbacl man page on DragonFly
Man page or keyword search:
man Server 44335 pages
apropos Keyword Search (all sections)
Output format

DBACL(1)							      DBACL(1)

NAME
       dbacl - a digramic Bayesian classifier for text recognition.

SYNOPSIS
       dbacl [-01dvnirmwMNDXW] [-T type ] -l category [-h size] [-H gsize] [-x
	      decim] [-q quality] [-w max_order] [-e deftok] [-o  online]  [-L
	      measure] [-z ftresh] [-O ronline]...  [-g regex]...  [FILE]...

       dbacl  [-vnimNRXYP]  [-h	 size]	[-T type] -c category [-c category]...
	      [-f keep]...  [FILE]...

       dbacl -V

OVERVIEW
       dbacl is a Bayesian text	 and  email  classifier.  When	using  the  -l
       switch,	it  learns  a  body  of text and produce a file named category
       which summarizes the text. When using the -c  switch,  it  compares  an
       input  text  stream  with any number of category files, and outputs the
       name of the closest  match,  or	optionally  various  numerical	scores
       explained below.

       Whereas	this manual page is intended as a reference, there are several
       tutorials and documents you can read to	get  specialized  information.
       Specific	 documentation	about  the design of dbacl and the statistical
       models that it uses can be found in dbacl.ps.  For a basic overview  of
       text  classification  using dbacl, see tutorial.html. A companion tuto‐
       rial geared towards email filtering is email.html. If you have  trouble
       getting dbacl to classify reliably, read is_it_working.html.  The USAGE
       section of this manual page also has some examples.

       /usr/local/share/dbacl/doc/dbacl.ps

       /usr/local/share/dbacl/doc/tutorial.html

       /usr/local/share/dbacl/doc/email.html

       /usr/local/share/dbacl/doc/is_it_working.html

       dbacl uses a maximum entropy (minimum divergence) language  model  con‐
       structed	 with  respect to a digramic reference measure (unknown tokens
       are predicted from digrams, i.e. pairs of letters).  Practically,  this
       means  that  a category is constructed from tokens in the training set,
       while previously unseen tokens  can  be	predicted  automatically  from
       their  letters.	A token here is either a word (fragment) or a combina‐
       tion of words (fragments),  selected  according	to  various  switches.
       Learning roughly works by tweaking token probabilities until the train‐
       ing data is least surprising.

EXIT_STATUS
       The normal shell exit conventions aren't followed (sorry!). When	 using
       the -l command form, dbacl returns zero on success, nonzero if an error
       occurs. When using the -c form, dbacl returns a positive integer corre‐
       sponding	 to  the  category  with the highest posterior probability. In
       case of a tie, the first most probable category is chosen. If an	 error
       occurs, dbacl returns zero.

DESCRIPTION
       When  using the -l command form, dbacl learns a category when given one
       or more FILE names, which should contain readable  ASCII	 text.	If  no
       FILE  is	 given, dbacl learns from STDIN. If FILE is a directory, it is
       opened and all its files are read,  but	not  its  subdirectories.  The
       result  is  saved  in  the  binary  file named category, and completely
       replaces any previous contents. As a convenience,  if  the  environment
       variable DBACL_PATH contains a directory, then that is prepended to the
       file path, unless category starts with a '/' or a '.'.

       The input text for learning is assumed to be unstructured plain text by
       default.	 This  is  not suitable for learning email, because email con‐
       tains various transport encodings and formatting instructions which can
       reduce classification effectiveness. You must use the -T switch in that
       case so that dbacl knows it should perform decoding  and	 filtering  of
       MIME  and HTML as appropriate.  Apropriate switch values are "-T email"
       for RFC2822 email input, "-T html" for HTML input, "-T xml" for generic
       XML  style  input and "-T text" is the default plain text format. There
       are other values of the -T switch that also allow fine  tuning  of  the
       decoding capabilities.

       When  using  the	 -c  command form, dbacl attempts to classify the text
       found in FILE, or STDIN if no FILE is  given.  Each  possible  category
       must  be	 given separately, and should be the file name of a previously
       learned text corpus. As a convenience, if the variable DBACL_PATH  con‐
       tains  a	 directory,  it	 is  prepended to each file path which doesn't
       start with a '/' or a '.'. The visible  output  of  the	classification
       depends	on  the	 combination  of  extra switches used. If no switch is
       used, then no output is shown on STDOUT. However, dbacl always produces
       an exit code which can be tested.

       To see an output for a classification, you must use at least one of the
       -v,-U,-n,-N,-D,-d switches. Sometimes, they can be used in  combination
       to  produce a natural variation of their individual outputs. Sometimes,
       dbacl also produces warnings on STDERR if applicable.

       The -v switch outputs the name of  the  best  category  among  all  the
       choices given.

       The  -U switch outputs the name of the best category followed by a con‐
       fidence percentage. Normally, this is the switch that you want to  use.
       A  percentage  of  100% means that dbacl is sure of its choice, while a
       percentage of 0% means that some other category is equally likely. This
       is  not the model probability, but measures how unambiguous the classi‐
       fication is, and can be used to tag unsure classifications (e.g. if the
       confidence is 25% or less).

       The  -N	switch	prints	each category name followed by its (posterior)
       probability, expressed as a percentage. The percentages always  sum  to
       100%.  This is intuitive, but only valuable if the document being clas‐
       sified contains a handful of tokens (ten or less). In the  common  case
       with  many more tokens, the probabilities are always extremely close to
       100% and 0%.

       The -n switch prints each category name followed by the negative	 loga‐
       rithm  of  its  probability. This is equivalent to using the -N switch,
       but much more useful. The smallest number gives the  best  category.  A
       more  convenient	 form is to use both -n and -v which prints each cate‐
       gory name followed by the cross entropy and the number of  tokens  ana‐
       lyzed.  The  cross  entropy  measures (in bits) the average compression
       rate which is achievable, under the given category model, per token  of
       input  text.  If	 you  use all three of -n,-v,-X then an extra value is
       output for each category, representing a kind of p-value for each cate‐
       gory  score.  This  indicates  how typical the score is compared to the
       training documents, but only works if the -X  switch  was  used	during
       learning, and only for some types of models (e.g. email).  These p-val‐
       ues are uniformly distributed and independent (if  the  categories  are
       independent),  so  can  be  combined using Fisher's chi squared test to
       obtain composite p-values for groupings of categories.

       The -v and -X switches together print each category name followed by  a
       detailed	 decomposition	of  the category score, factored into ( diver‐
       gence rate + shannon entropy rate )* token count @ p-value. Again, this
       only works in some types of models.

       The  -v and -U switches print each category name followed by a decompo‐
       sition of the category score into ( divergence rate +  shannon  entropy
       rate # score variance )* token count.

       The -D switch prints out the input text as modified internally by dbacl
       prior to tokenization. For example, if a MIME encoded email document is
       classified,  then  this	prints	the decoded text that will be actually
       tokenized and classified. This switch is mainly useful for debugging.

       The -d switch dumps tokens and scores while they are being read. It  is
       useful  for  debugging,	or if you want to create graphical representa‐
       tions of the classification. A detailed explanation of  the  output  is
       beyond  the scope of this manual page, but is straightforward if you've
       read dbacl.ps.  Possible variations include -d together with -n or -N.

       Classification can be done with one or several categories in principle.
       When  two or more categories are used, the Bayesian posterior probabil‐
       ity is used, given the input text, with a uniform prior distribution on
       categories.  For	 other	choices	 of  prior,  see the companion utility
       bayesol(1).  When a single category is used, classification can be done
       by  comparing the score with a treshold. In practice however, much bet‐
       ter results are obtained with several categories.

       Learning and classifying cannot be mixed on the	same  command  invoca‐
       tion,  however there are no locking issues and separate dbacl processes
       can operate simultaneously with obvious results,	 because  file	opera‐
       tions are designed to be atomic.

       Finally,	 note that dbacl does not manage your document corpora or your
       computed categories.  In particular, dbacl cannot  add  or  subtract  a
       document	 from  a category file directly.  If you want to learn a cate‐
       gory incrementally, the standard way is to keep adding to your document
       corpus,	and  learn  the	 whole corpus each time. By keeping control of
       your archives, you can never lose the information in  your  categories,
       and  you can easily experiment with different switches or tokenizations
       or sets of training documents if you like.

       If the standard incremental learning method is too slow, the -o	switch
       can  help. This creates a data file named online which contains all the
       document statistics that have been learned. When you use the -l and  -o
       switches	 together,  dbacl  merges  the online data file (if it exists)
       with the new document(s) to be learned, and recreates an	 updated  ver‐
       sion  of online.	 This is equivalent to adding the new documents to the
       corpus and relearning the whole corpus, but faster. However,  documents
       cannot  be  removed  if	you  change your mind. This is a limitation of
       dbacl which cannot be changed for mathematical reasons.	You  can  work
       around  this by making backups of the online data file. It is also pos‐
       sible to merge one or more extra online data  files  simultaneously  by
       using the -O switch one or more times.

SECONDARY_SWITCHES
       By default, dbacl classifies the input text as a whole, ie it only out‐
       puts a single result even if you specify several input  files.  If  you
       want to classify multiple input files you can either call dbacl repeat‐
       edly (which is fast when you use the -m switch), or use the -F  switch,
       which  prints  each  input  FILE	 followed by the result for that FILE.
       Alternatively, you can classify each line of the input individually, by
       using  the  -f option, which prints only those lines which match one or
       more models identified by keep (use the	category  name	or  number  to
       refer  to a category). This last switch is useful if you want to filter
       out some lines, but note that if the lines are short,  then  the	 error
       rate can be high.

       The  -e,-w,-g,-j	 switches  are	used for selecting an appropriate tok‐
       enization scheme. A token is a word or word fragment or combination  of
       words  or  fragments. The shape of tokens is important because it forms
       the basis of the language models used by dbacl.	The -e switch  selects
       a  predefined tokenization scheme, which is speedy but limited.	The -w
       switch specifies composite tokens derived from the -e switch. For exam‐
       ple,  "-e  alnum	 -w  2"	 means that tokens should be alphanumeric word
       fragments combined into overlapping pairs (bigrams). When the -j switch
       is  used, all tokens are converted to lowercase, which reduces the num‐
       ber of possible tokens and therefore memory consumption.

       If the -g switch is used, you can completely specify  what  the	tokens
       should look like using a regular expression. Several -g switches can be
       used to construct complex tokenization schemes, and parentheses	within
       each  expression	 can be used to select fragments and combine them into
       n-grams. The cost of such flexibility  is  reduced  classification  and
       learning speed. When experimenting with tokenization schemes, try using
       the -d or -D switches while learning or classifying, as they will print
       the  tokens explicitly so you can see what text fragments are picked up
       or missed out. For regular exression syntax, see regex(7).

       The -h and -H switches regulate how  much  memory  dbacl	 may  use  for
       learning.  Text	classification can use a lot of memory, and by default
       dbacl limits itself even at the expense of learning accuracy.  In  many
       cases  if  a  limit  is	reached,  a warning message will be printed on
       STDERR with some advice.

       When relearning the same category several times, a significant  speedup
       can  be	obtained by using the -1 switch, as this allows the previously
       learned probabilities to be read from the category and reused.

       Note that classification accuracy depends foremost on  the  amount  and
       quality of the training samples, and then only on amount of tweaking.

EXIT STATUS
       When  using  the	 -l  command form, dbacl returns zero on success. When
       using the -c form, dbacl returns a positive integer  (1,2,3...)	corre‐
       sponding	 to  the  category  with the highest posterior probability. In
       case of a tie, the first most probable category is chosen. If an	 error
       occurs, dbacl returns zero.

OPTIONS
       -0     When  learning,  prevents	 weight	 preloading.  Normally,	 dbacl
	      checks if the category file already exists, and if so, tries  to
	      use  the existing weights as a starting point. This can dramati‐
	      cally speed up learning.	If the -0 (zero) switch is  set,  then
	      dbacl  behaves  as  if  no category file already exists. This is
	      mainly useful for	 testing.   This  switch  is  now  enabled  by
	      default,	to protect against weight drift which can reduce accu‐
	      racy over many learning iterations. Use -1 to force preloading.

       -1     Force weight preloading if the category file already exists. See
	      discussion of the -0 switch.

       -a     Append  scores.  Every  input  line is written to STDOUT and the
	      dbacl scores are appended. This  is  useful  for	postprocessing
	      with  bayesol(1).	  For ease of processing, every original input
	      line is indented by a single space (to distinguish them from the
	      appended	scores),  and the line with the scores (if -n is used)
	      is prefixed with the string "scores ". If a second copy of dbacl
	      needs  to	 read this output later, it should be invoked with the
	      -A switch.

       -d     Dump the model parameters to STDOUT. In conjunction with the  -l
	      option,  this  produces  a human-readable summary of the maximum
	      entropy model. In conjunction with the -c option,	 displays  the
	      contribution  of	each  token to the final score. Suppresses all
	      other normal output.

       -e     Select character class for default (not  regex-based)  tokeniza‐
	      tion.  By default, tokens are alphabetic strings only. This cor‐
	      responds to the case when deftok is "alpha". Possible values for
	      deftok  are  "alpha", "alnum", "graph", "char", "cef" and "adp".
	      The last two are custom tokenizers intended for email  messages.
	      See  also	 isalpha(3).   The  "char"  tokenizer  picks up single
	      printable characters rather than bigger tokens, and is  intended
	      for testing only.

       -f     Filter  each  line  of  input separately, passing to STDOUT only
	      lines which match the category identified as keep.  This	option
	      should  be used repeatedly for each category which must be kept.
	      keep can be either the category file name, or a positive integer
	      representing  the required category in the same order it appears
	      on the command line.

	      Output lines are flushed as soon as they	are  written.  If  the
	      input  file  is  a  pipe or character device, then an attempt is
	      made to use line buffering mode, otherwise  the  more  efficient
	      block buffering is used.

       -g     Learn only features described by the extended regular expression
	      regex.  This overrides the default feature selection method (see
	      -w  option) and learns, for each line of input, only tokens con‐
	      structed from the	 concatenation	of  strings  which  match  the
	      tagged subexpressions within the supplied regex.	All substrings
	      which match regex within a suffix of each input line are treated
	      as features, even if they overlap on the input line.

	      As  an  optional convenience, regex can include the suffix ||xyz
	      which indicates which  parenthesized  subexpressions  should  be
	      tagged. In this case, xyz should consist exclusively of digits 1
	      to 9, numbering exactly those  subexpressions  which  should  be
	      tagged.  Alternatively,  if  no  parentheses exist within regex,
	      then it is assumed that the whole expression must be captured.

       -h     Set the size of the hash table to 2^size	elements.  When	 using
	      the  -l  option,	this  refers  to  the total number of features
	      allowed in the maximum entropy model being learned.  When	 using
	      the  -c option toghether with the -M switch and multinomial type
	      categories, this refers to the maximum number of features	 taken
	      into account during classification.  Without the -M switch, this
	      option has no effect.

       -i     Fully internationalized mode. Forces the use of wide  characters
	      internally,  which  is  necessary in some locales. This incurs a
	      noticeable performance penalty.

       -j     Make features case sensitive. Normally, all  features  are  con‐
	      verted  to  lower	 case during processing, which reduces storage
	      requirements  and	 improves  statistical	estimates  for	 small
	      datasets.	 With this option, the original capitalization is used
	      for each feature. This can improve classification accuracy.

       -m     Aggressively maps categories into memory and locks them into RAM
	      to  prevent  swapping, if possible. This is useful when speed is
	      paramount and memory is plentiful, for example when testing  the
	      classifier on large datasets.

	      Locking  may  require  relaxing user limits with ulimit(1).  Ask
	      your system administrator.  Beware  when	using  the  -m	switch
	      together	with  the  -o  switch,	as only one dbacl process must
	      learn or classify at a time to prevent file  corruption.	If  no
	      learning	takes  place,  then  the  -m switch for classifying is
	      always safe to use. See also the discussion for the -o switch.

       -n     Print scores for each category.  Each score is  the  product  of
	      two  numbers,  the cross entropy and the complexity of the input
	      text under each model. Multiplied together, they	represent  the
	      log probability that the input resembles the model. To see these
	      numbers separately, use also the -v option. In conjunction  with
	      the  -f  option,	stops  filtering  but  prints  each input line
	      prepended with a list of scores for that line.

       -q     Select quality of learning, where quality can be 1,2,3,4. Higher
	      values  take  longer to learn, and should be slightly more accu‐
	      rate. The default quality is 1  if  the  category	 file  doesn't
	      exist or weights cannot be preloaded, and 2 otherwise.

       -o     When  learning, reads/writes partial token counts so they can be
	      reused. Normally, category files are learned  from  exactly  the
	      input data given, and don't contain extraneous information. When
	      this option is in effect, some extra information is saved in the
	      file  online,  after all input was read. This information can be
	      reread the next time that learning occurs, to continue where the
	      previous	dataset	 left off. If online doesn't exist, it is cre‐
	      ated. If online exists, it is read before learning, and  updated
	      afterwards.  The file is approximately 3 times bigger (at least)
	      than the learned category.

	      In dbacl, file updates are atomic, but if using the  -o  switch,
	      two  or  more processes should not learn simultaneously, as only
	      one process will write a lasting category and memory  dump.  The
	      -m  switch can also speed up online learning, but beware of pos‐
	      sible corruption.	 Only one process should read or write a file.
	      This  option is intended primarily for controlled test runs. See
	      also the -O (big-oh) switch.

       -r     Learn the digramic reference model only. Skips the  learning  of
	      extra features in the text corpus.

       -v     Verbose  mode.  When learning, print out details of the computa‐
	      tion, when classifying, print out the name of the most  probable
	      category.	  In conjunction with the -n option, prints the scores
	      as an explicit product of the cross entropy and the complexity.

       -w     Select default features to be n-grams up to max_order.  This  is
	      incompatible  with the -g option, which always takes precedence.
	      If no -w or -g options are given, dbacl assumes -w 1. Note  that
	      n-grams  for  n  greater	than  1 do not straddle line breaks by
	      default.	The -S switch enables line straddling.

       -x     Set decimation probability to 1 - 2^(-decim).  To reduce	memory
	      requirements  when  learning,  some inputs are randomly skipped,
	      and only a few are added to the model.  Exact behaviour  depends
	      on  the  applicable  -T option (default is -T "text").  When the
	      type is not "email" (eg "text"), then individual input  features
	      are added with probability 2^(-decim). When the type is "email",
	      then full input messages are added with probability  2^(-decim).
	      Within each such message, all features are used.

       -z     When  learning, only take into account features whose occurrence
	      count is strictly greater than ftreshold. By default,  ftreshold
	      is zero, so all features in the training corpus are used. A neg‐
	      ative value of ftreshold causes dbacl to subtract from the maxi‐
	      mum  observed  feature count, and to use that if it is positive.
	      For example, -z 1 means dbacl only learns features  which	 occur
	      at  least twice in the corpus, and -z -5 means dbacl only learns
	      the feature(s) whose occurrence count is within 4 of the	global
	      maximum.

       -A     Expect  indented	input  and  scores.  With  this	 switch, dbacl
	      expects input lines to be indented by a single  space  character
	      (which  is then skipped).	 Lines starting with any other charac‐
	      ter are ignored. This is the counterpart to the -a switch above.
	      When used together with the -a switch, dbacl outputs the skipped
	      lines as they are, and reinserts the space at the front of  each
	      processed input line.

       -D     Print  debug output. Do not use normally, but can be very useful
	      for displaying the list features picked up while learning.

       -F     For each FILE of input, print the	 FILE  name  followed  by  the
	      classification  result  (normally	 dbacl	only  prints  a single
	      result even if multiple files are listed as input).

       -H     Allow hash table to grow up to a	maximum	 of  2^gsize  elements
	      during learning. Initial size is given by -h option.

       -L     Select the digramic reference measure for character transitions.
	      The measure can be one of "uniform",  "dirichlet"	 or  "maxent".
	      Default is "uniform".

       -M     Force  multinomial calculations. When learning, forces the model
	      features to be treated multinomially. When classifying, corrects
	      entropy scores to reflect multinomial probabilities (only appli‐
	      cable to multinomial type	 models,  if  present).	  Scores  will
	      always be lower, because the ordering of features is lost.

       -N     Print  posterior	probabilities for each category.  This assumes
	      the supplied categories form an exhaustive  list	of  possibili‐
	      ties.   In  conjunction  with the -f option, stops filtering but
	      prints each input line prepended with a summary of the posterior
	      distribution for that line.

       -O     This  switch  causes  the	 online	 data file named ronline to be
	      merged during learning. The ronline file must be	created	 using
	      the  -o (little-oh) switch.  Several -O data files can be merged
	      simultaneously. This is intended to be a read  only  version  of
	      -o, to allow piecing together of several sets of preparsed data.
	      See the description of the -o switch.

       -R     Include an extra category for purely random text.	 The  category
	      is called "random".  Only makes sense when using the -c option.

       -P     Correct the category scores to include estimated prior probabil‐
	      ities. The prior probability estimate for each category is  pro‐
	      portional	 to  the  number of documents or, if that doesn't make
	      sense, the number of unique features. This can help  with	 "bal‐
	      ancing"  when  one  category is learned from much more data than
	      another. If all categories are learned  from  approximately  the
	      same  amount  of data (or maybe within a factor of 2), then this
	      option should have little qualitative effect.

       -S     Enable line straddling. This is  useful  together	 with  the  -w
	      option  to  allow	 n-grams for n > 1 to ignore line breaks, so a
	      complex token can continue past the end of the line. This is not
	      recommended for email.

       -T     Specify  nonstandard text format. By default, dbacl assumes that
	      the input text is a purely ASCII text file. This corresponds  to
	      the case when type is "text".

	      There  are several types and subtypes which can be used to clean
	      the input text of extraneous tokens before  actual  learning  or
	      classifying  takes place. Each (sub)type you wish to use must be
	      indicated with a separate -T option on  the  command  line,  and
	      automatically implies the corresponding type.

	      The  "text"  type	 is for unstructured plain text. No cleanup is
	      performed. This is the default if no types are given on the com‐
	      mand line.

	      The "email" type is for mbox format input files or single RFC822
	      emails.  Headers are recognized and most are skipped. To include
	      extra  RFC822  standard  headers (except for trace headers), use
	      the "email:headers" subtype.  To include trace headers, use  the
	      "email:theaders"	subtype.  To include all headers in the email,
	      use the "email:xheaders" subtype. To skip	 all  headers,	except
	      the  subject,  use "email:noheaders". To scan binary attachments
	      for strings, use the "email:atts" subtype.

	      When the "email" type is in effect, HTML markup is automatically
	      removed  from text attachments except text/plain attachments. To
	      also  remove  HTML  markup  from	plain  text  attachments,  use
	      "email:noplain".	To  prevent  HTML  markup  removal in all text
	      attachments, use "email:plain".

	      The "html" type is for removing HTML markup (between <html>  and
	      </html>  tags)  and  surrounding	text. Note that if the "email"
	      type is enabled, then "html" is automatically enabled  for  com‐
	      patible message attachments only.

	      The  "xml"  type	is  like "html", but doesn't honour <html> and
	      </html>, and doesn't interpret tags  (so	this  should  be  more
	      properly	called	"angle	markup" removal, and has nothing to do
	      with actual XML semantics).

	      When "html" is enabled, most markup  attributes  are  lost  (for
	      values  of  'most'  close	 to  'all').  The "html:links" subtype
	      forces link urls to be parsed and learned, which would otherwise
	      be ignored. The "html:alt" subtype forces parsing of alternative
	      text  in	ALT   attributes   and	 various   other   tags.   The
	      "html:scripts"  subtype forces parsing of scripts, "html:styles"
	      forces parsing of styles, "html:forms" forces  parsing  of  form
	      values, while "html:comments" forces parsing of HTML comments.

       -U     Print  (U)nambiguity.   When  used  in  conjunction  with the -v
	      switch, prints scores followed by their empirical standard devi‐
	      ations.  When  used alone, prints the best category, followed by
	      an estimated probability that this category choice is  unambigu‐
	      ous. More precisely, the probability measures lack of overlap of
	      CLT confidence intervals for each category score	(If  there  is
	      overlap, then there is ambiguity).

	      This estimated probability can be used as an "unsure" flag, e.g.
	      if the estimated probability is  lower  than  50%.  Formally,  a
	      score of 0% means another category is equally likely to apply to
	      the input, and a score of 100% means no other category is likely
	      to  apply	 to  the  input.  Note that this type of confidence is
	      unrelated to the -X switch. Also, the  probability  estimate  is
	      usually low if the document is short, or if the message contains
	      many tokens that have never been seen before  (only  applies  to
	      uniform digramic measure).

       -V     Print the program version number and exit.

       -W     Like -w, but prevents features from straddling newlines. See the
	      description of -w.

       -X     Print the confidence in the score calculated for each  category,
	      when  used together with the -n or -N switch. Prepares the model
	      for confidence scores, when used with the -l switch.  The confi‐
	      dence  is	 an  estimate of the typicality of the score, assuming
	      the null hypothesis that the given  category  is	correct.  When
	      used  with  the  -v  switch  alone,  factorizes the score as the
	      empirical divergence plus the  shannon  entropy,	multiplied  by
	      complexity, in that order. The -X switch is not supported in all
	      possible models, and displays a percentage of "0.0" if it	 can't
	      be calculated. Note that for unknown documents, it is quite com‐
	      mon to have confidences close to zero.

       -Y     Print the cumulative media counts.  Some	tokenizers  include  a
	      medium variable with each token: for example, in email classifi‐
	      cation the word "the" can appear in the subject or the body of a
	      message,	but  the  subject is counted as a separate medium from
	      the body. This allows the token frequencies to be kept separate,
	      even  though the word is the same. Currently, up to 16 different
	      media are supported (0-15), with	the  following	interpretation
	      for email:

	       0   unused.
	       1   default medium.
	       2   mail body or attachment in HTML format.
	       3   mail body or attachment in plain text format.
	       4   mail header unknown.
	       5   User-Agent, Comments, Keywords, Note
	       6   X-MS*, Categor*, Priority, Importance, Thread-*
	       7   X-*
	       8   List-*
	       9   MIME-Version, Content-*
	       10  Subject
	       11  To
	       12  Sender, Sent, BCC, CC, From
	       13  Resent-*, Original-*
	       14  Message-ID, References, In-Reply-To
	       15  Received, Return-Path, Return-Receipt-To, Reply-To

	      The -Y switch prints the number of tokens observed in each sepa‐
	      rate medium, in order from 0 to 15.

USAGE
       To create two category files in the current directory  from  two	 ASCII
       text  files  named  Mark_Twain.txt  and William_Shakespeare.txt respec‐
       tively, type:

       % dbacl -l twain Mark_Twain.txt
       % dbacl -l shake William_Shakespeare.txt

       Now you can classify input text, for example:

       % echo "howdy" | dbacl -v -c twain -c shake
       twain
       % echo "to be or not to be" | dbacl -v -c twain -c shake
       shake

       Note that the -v option at least is necessary, otherwise dbacl does not
       print  anything. The return value is 1 in the first case, 2 in the sec‐
       ond.

       % echo "to be or not to be" | dbacl -v -N -c twain -c shake
       twain 22.63% shake 77.37%
       % echo "to be or not to be" | dbacl -v -n -c twain -c shake
       twain  7.04 * 6.0 shake	6.74 * 6.0

       These invocations are equivalent. The numbers 6.74 and  7.04  represent
       how  close the average token is to each category, and 6.0 is the number
       of tokens observed. If you want to  print  a  simple  confidence	 value
       together with the best category, replace -v with -U.

       % echo "to be or not to be" | dbacl -U -c twain -c shake
       shake # 34%

       Note  that the true probability of category shake versus category twain
       is 77.37%, but the calculation is somewhat ambiguous, and  34%  is  the
       confidence out of 100% that the calculation is qualitatively correct.

       Suppose	a  file	 document.txt contains English text lines interspersed
       with noise lines. To filter out the noise lines from the English lines,
       assuming you have an existing category shake say, type:

       % dbacl -c shake -f shake -R document.txt > document.txt_eng
       % dbacl -c shake -f random -R document.txt > document.txt_rnd

       Note  that  the	quality of the results will vary depending on how well
       the categories shake and random represent each input line.  It is some‐
       times  useful  to see the posterior probabilities for each line without
       filtering:

       % dbacl -c shake -f shake -RN document.txt > document.txt_probs

       You can now postprocess the posterior probabilities for	each  line  of
       text  with  another script, to replicate an arbitrary Bayesian decision
       rule of your choice.

       In the special case of exactly two  categories,	the  optimal  Bayesian
       decision	 procedure can be implemented for documents as follows: let p1
       be the prior probability that the input text  is	 classified  as	 cate‐
       gory1.  Consequently, the prior probability of classifying as category2
       is 1 - p1.  Let u12 be the cost of  misclassifying  a  category1	 input
       text as belonging to category2 and vice versa for u21.  We assume there
       is no cost for  classifying  correctly.	 Then  the  following  command
       implements the optimal Bayesian decision:

       % dbacl -n -c category1 -c category2 | awk '{ if($2 * p1 * u12 > $4 *
	      (1 - p1) * u21) { print $1; } else { print $3; } }'

       dbacl can also be used in conjunction with procmail(1) to  implement  a
       simple  Bayesian email classification system. Assume that incoming mail
       should be automatically delivered to one of three mail folders  located
       in  $MAILDIR and named work, personal, and spam.	 Initially, these must
       be created and filled with appropriate  sample  emails.	 A  crontab(1)
       file can be used to learn the three categories once a day, e.g.

       CATS=$HOME/.dbacl
       5  0 * * * dbacl -T email -l $CATS/work $MAILDIR/work
       10 0 * * * dbacl -T email -l $CATS/personal $MAILDIR/personal
       15 0 * * * dbacl -T email -l $CATS/spam $MAILDIR/spam

       To  automatically  deliver  each	 incoming  email  into the appropriate
       folder, the following procmailrc(5) recipe fragment could be used:

       CATS=$HOME/.dbacl

       # run the spam classifier
       :0 c
       YAY=| dbacl -vT email -c $CATS/work -c $CATS/personal -c $CATS/spam

       # send to the appropriate mailbox
       :0:
       * ? test -n "$YAY"
       $MAILDIR/$YAY

       :0:
       $DEFAULT

       Sometimes, dbacl will send the email to	the  wrong  mailbox.  In  that
       case, the misclassified message should be removed from its wrong desti‐
       nation and placed in the correct mailbox.  The error will be  corrected
       the  next  time	your messages are learned.  If it is left in the wrong
       category, dbacl will learn the wrong corpus statistics.

       The default text features (tokens) read by dbacl are purely  alphabetic
       strings,	 which minimizes memory requirements but can be unrealistic in
       some cases. To construct models based on alphanumeric tokens,  use  the
       -e  switch.  The	 example below also uses the optional -D switch, which
       prints a list of actual tokens found in the document:

       % dbacl -e alnum -D -l twain Mark_Twain.txt | less

       It is also possible to override the default  feature  selection	method
       used  to	 learn the category model by means of regular expressions. For
       example, the following duplicates the default feature selection	method
       in the C locale, while being much slower:

       % dbacl -l twain -g '^([[:alpha:]]+)' -g '[^[:alpha:]]([[:alpha:]]+)'
	      Mark_Twain.txt

       The category twain which is obtained depends only on single  alphabetic
       words  in  the text file Mark_Twain.txt (and computed digram statistics
       for prediction).	 For a second example, the following command builds  a
       smoothed	 Markovian  (word bigram) model which depends on pairs of con‐
       secutive words within each line	(but  pairs  cannot  straddle  a  line
       break):

       % dbacl -l twain2 -g '(^|[^[:alpha:]])([[:alpha:]]+)||2' -g
	      '(^|[^[:alpha:]])([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)||23'
	      Mark_Twain.txt

       More  general, line based, n-gram models of all orders (up to 7) can be
       built in a similar way.	 To  construct	paragraph  based  models,  you
       should  reformat	 the input corpora with awk(1) or sed(1) to obtain one
       paragraph per line. Line size is limited by available memory, but  note
       that regex performance will degrade quickly for long lines.

PERFORMANCE
       The  underlying assumption of statistical learning is that a relatively
       small number of training documents can represent a much larger  set  of
       input  documents.  Thus	in  the long run, learning can grind to a halt
       without serious impact on classification accuracy. While	 not  true  in
       reality,	 this assumption is surprisingly accurate for problems such as
       email filtering.	 In practice, this means that a well chosen corpus  on
       the  order  of ten thousand documents is sufficient for highly accurate
       results for years.  Continual  learning	after  such  a	critical  mass
       results in diminishing returns.	Of course, when real world input docu‐
       ment patterns change dramatically, the predictive power of  the	models
       can  be	lost.  At  the other end, a few hundred documents already give
       acceptable results in most cases.

       dbacl is heavily optimized for the case of frequent classifications but
       infrequent  batch  learning.  This  is  the  long run optimum described
       above. Under ideal conditions, dbacl can classify a hundred emails  per
       second  on low end hardware (500Mhz Pentium III). Learning speed is not
       very much slower, but takes effectively much longer for large  document
       collections for various reasons.	 When using the -m switch, data struc‐
       tures are aggressively mapped into memory if possible,  reducing	 over‐
       heads for both I/O and memory allocations.

       dbacl  throws  away its input as soon as possible, and has no limits on
       the input document size. Both classification  and  learning  speed  are
       directly	 proportional to the number of tokens in the input, but learn‐
       ing also needs a nonlinear optimization step which takes	 time  propor‐
       tional  to the number of unique tokens discovered.  At time of writing,
       dbacl is one of the fastest open source mail filters given its  optimal
       usage scenario, but uses more memory for learning than other filters.

MULTIPLE PROCESSES AND DATA CORRUPTION
       When  saving category files, dbacl first writes out a temporary file in
       the same location, and renames it afterwards. If	 a  problem  or	 crash
       occurs  during  learning,  the  old  category  file  is	therefore left
       untouched. This ensures that categories can never be corrupted, no mat‐
       ter  how	 many  processes  try to simultaneously learn or classify, and
       means that valid categories are available  for  classification  at  any
       time.

       When  using  the	 -m switch, file contents are memory mapped for speedy
       reading and writing. This, together with the  -o	 switch,  is  intended
       mainly for testing purposes, when tens of thousands of messages must be
       learned and scored in a laboratory to measure dbacl's accuracy. Because
       no  file	 locking is attempted for performance reasons, corruptions are
       possible, unless you make sure that only one  dbacl  process  reads  or
       writes  any  file  at  any given time. This is the only case (-m and -o
       together) when corruption is possible.

MEMORY USE
       When classifying a document, dbacl loads all indicated categories  into
       RAM,  so	 the total memory needed is approximately the sum of the cate‐
       gory file sizes plus a fixed small overhead.   The  input  document  is
       consumed	 while	being  read, so its size doesn't matter, but very long
       lines can take up space.	 When using the -m switch, the categories  are
       read using mmap(2) as available.

       When  learning,	dbacl keeps a large structure in memory which contains
       many objects which won't be saved into the output category. The size of
       this structure is proportional to the number of unique tokens read, but
       not the size of the input documents, since  they	 are  discarded	 while
       being  read.  As a rough guide, this structure is 4x-5x the size of the
       final category file that is produced.

       To prevent unchecked memory growth, dbacl allocates by default a	 fixed
       smallish	 amount of memory for tokens. When this space is used up, fur‐
       ther tokens are discarded which has the effect of skewing  the  learned
       category making it less usable as more tokens are dropped. A warning is
       printed on STDERR in such a case.

       The -h switch lets you fix the initial size of the token space in  pow‐
       ers  of	2, ie "-h 17" means 2^17 = 131072 possible tokens. If you type
       "dbacl -V", you can see the number of bytes needed for each token  when
       either  learning	 or  classifying.  Multiply this number by the maximum
       number of possible tokens to estimate the memory needed	for  learning.
       The  -H	switch	lets  dbacl  grow its tables automatically if and when
       needed, up to a maximum specified. So if you type  "-H  21",  then  the
       initial	size  will  be doubled repeatedly if necessary, up to approxi‐
       mately two million unique tokens.

       When learning with the -X switch, a handful of input documents are also
       kept in RAM throughout.

ENVIRONMENT
       DBACL_PATH
	      When this variable is set, its value is prepended to every cate‐
	      gory filename which doesn't start with a '/' or a '.'.

SIGNALS
       INT    If this signal is caught, dbacl simply exits without  doing  any
	      cleanup  or  other  operations. This signal can often be sent by
	      pressing Ctrl-C on the keyboard. See stty(1).

       HUP, QUIT, TERM
	      If one of these signals is caught, dbacl stops reading input and
	      continues	 its operation as if no more input was available. This
	      is a way of quitting gracefully, but note that in learning mode,
	      a	 category  file will be written based on the incomplete input.
	      The QUIT signal can often be sent by pressing Ctrl- on the  key‐
	      board. See stty(1).

       USR1   If  this	signal is caught, dbacl reloads the current categories
	      at the earliest feasible opportunity. This is not normally  use‐
	      ful  at  all,  but  might be in special cases, such as if the -f
	      switch is invoked together with input from a long running pipe.

NOTES
       dbacl generated category files are in binary format, and may or may not
       be  portable to systems using a different byte order architecture (this
       depends on how dbacl was compiled). The -V switch  prints  out  whether
       categories are portable, or else you can just experiment.

       dbacl  does  not recognize functionally equivalent regular expressions,
       and in this case duplicate features will be counted several times.

       With every learned category, the command line options  that  were  used
       are  saved.   When  classifying, make sure that every relevant category
       was learned with the same set of options (regexes are allowed  to  dif‐
       fer),  otherwise behaviour is undefined. There is no need to repeat all
       the switches when classifying.

       If you get many digitization warnings, then you are trying to learn too
       much  data at once, or your model is too complex.  dbacl is compiled to
       save memory by digitizing final weights, but you can disable  digitiza‐
       tion by editing dbacl.h and recompiling.

       dbacl  offers  several built-in tokenizers (see -e switch) with more to
       come in future versions, as the author invents them.  While the default
       tokenizer  may evolve, no tokenizer should ever be removed, so that you
       can always simulate previous dbacl behaviour subject to bug  fixes  and
       architectural changes.

       The  confidence estimates obtained through the -X switch are underesti‐
       mates, ie are more conservative than they should be.

BUGS
       "Ya know, some day scientists are gonna invent something that will out‐
       smart a rabbit." (Robot Rabbit, 1953)

SOURCE
       The  source code for the latest version of this program is available at
       the following locations:

       http://www.lbreyer.com/gpl.html
       http://dbacl.sourceforge.net

AUTHOR
       Laird A. Breyer <laird@lbreyer.com>

SEE ALSO
       awk(1), bayesol(1),  crontab(1),	 hmine(1),  hypex(1),  less(1),	 mail‐
       cross(1),   mailfoot(1),	  mailinspect(1),  mailtoe(1),	procmailex(5),
       regex(7), stty(1), sed(1)

Version 1.14.1	      Bayesian Text Classification Tools	      DBACL(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome