annoyance-filter man page on Cygwin

Man page or keyword search:  
man Server   22533 pages
apropos Keyword Search (all sections)
Output format
Cygwin logo
[printable version]

ANNOYANCE-FILTER(1)					   ANNOYANCE-FILTER(1)

NAME
       annoyance-filter - automatically detect junk mail

SYNOPSIS
       annoyance-filter [ options ]

DESCRIPTION
       annoyance-filter	 uses Bayesian statistics to determine the probability
       an E-mail message is junk based on an analysis of its contents compared
       to collections of known junk and legitimate E-mail.

       This  program  is  under	 active	 development;  new versions are posted
       frequently at:
		      http://www.fourmilab.ch/annoyance-filter/
       Please visit this page for news about the program and to	 download  the
       latest version.

       The  project  is	 hosted	 on  SourceForge,  where you will find the CVS
       source code repository and release archives:
		  http://sourceforge.net/projects/annoyancefilter/

USAGE
       annoyance-filter has a multitude of options which permit it to be  used
       in  many	 different  ways,  but	the  most  common application involves
       training the program with collections of legitimate and	junk  mail  in
       order to create a dictionary which indicates the probability that words
       identify a message as junk or non-junk (legitimate).  Training must  be
       done  before the program is used to classify incoming mail, but need be
       done  subsequently  only	 when  adding	messages   to	the   training
       collections.   As  long	as  the	 overall content of the mail, junk and
       legitimate, which you receive remains pretty much the same, there's  no
       need  to	 retrain,  but	the  ability  to  do  so allows the program to
       automatically adapt to evolving message content, which is  particularly
       characteristic of junk mail.

       Suppose	you have a collection of legitimate mail (in other words, mail
       you wish to read) in a file named m-good and a collection of junk  mail
       (that  which you don't wish to read) in file m-junk.  These collections
       may be in ``Unix mail folder'' format, which is simply the text of  one
       or more E-mail messages concatenated together in a single text file, or
       may be the names of directories containing files, each of which may  be
       a  single  E-mail  message or a Unix mail folder.  In either case, if a
       message	file  is  compressed  with  gzip,  it  will  be	 automatically
       uncompressed  on	 the  fly.   Directories of messages may not, however,
       contain other directories of messages.

       To  train  annoyance-filter  with  these	 collections  and   create   a
       dictionary, use a command like:

	annoyance-filter --mail m-good --junk m-junk --prune --write dict.bin

       where dict.bin is the name of the dictionary file you wish to create.

       Now  that the dictionary has been created, you can use it on subsequent
       runs to compute the probability a  message  is  junk  and  classify  it
       accordingly.   Suppose you have an E-mail message in the file mail.txt.
       To compute its junk priority and display it on standard output, use the
       command:

		  annoyance-filter --read dict.bin --test mail.txt

       To  integrate  annoyance-filter	into  a mail processing system such as
       procmail, you'll usually want  to  run  it  as  a  filter  which	 reads
       incoming	 messages  from	 standard  input  (piped  there	 by  the  mail
       processing system), classifies them and adds annotations to the message
       header  indicating  the	classification,	 then  writes the message with
       header annotations to standard output.  The mail processing system  may
       then  examine the header annotations and route the message accordingly.
       To filter a message, again  assuming  the  dictionary  created  by  the
       training run is in the file dict.bin, use the command:

	      annoyance-filter --read dict.bin --transcript - --test -

       Here  the  --transcript	option is used to request the input message be
       copied to an output file, in this case standard	output,	 specified  by
       ``-'', with the message read from standard input, the ``-'' argument to
       the --test option.

OPTIONS
       Options are specified on the command  line.   Options  are  treated  as
       commands—most  instruct	the  program  to perform some specific action;
       consequently, the order in which they  are  specified  is  significant;
       they  are  processed  left to right. Long options beginning with ``--''
       may be abbreviated to any  unambiguous  prefix;	single-letter  options
       introduced by a single ``-'' without arguments may be aggregated.

       --annotate options
		 Add the annotations requested by the characters in options to
		 the transcript generated by the --transcript  option.	 Upper
		 and  lower  case  options are treated identically.  Available
		 annotations are:
			     d	      Decoder diagnostics
			     p	      Parser warnings and error messages
			      w		Most  significant  words   and	 their
		 probabilities

       --autoprune n
	      As  the  dictionary  is bring built by appending mail to it with
	      the --mail and --junk options, unique words  will	 automatically
	      be  pruned from it whenever the dictionary exceeds approximately
	      n	 bytes.	  This	is  particularly  handy	 when  loading	 large
	      collections  of  messages with --phrasemax set greater than one,
	      as a very	 large	number	of  unique  phrases  may  clutter  the
	      dictionary  being	 built	and exceed the memory capacity of your
	      computer.	 You could split the  mail  collection	into  multiple
	      parts and explicitly --prune after each part, but --autoprune is
	      much more convenient.

       --biasmail n
	      The frequency of words appearing in legitimate mail is  inflated
	      by  the  floating	 point	factor	n,  which defaults to 2.  This
	      biases the classification	 of  messages  in  favour  of  ``false
	      negatives''—junk	mail  deemed  legitimate,  while  reducing the
	      probability of ``false positives'' (legitimate mail  erroneously
	      classified  as  junk,  which is bad).  The higher the setting of
	      --biasmail, the greater the bias in favour  of  false  negatives
	      will be.

       --binword n
	      Binary   character   streams   (for   example,   attachments  of
	      application-specific files, including  the  executable  code  of
	      worm and virus attachments) are scanned and contiguous sequences
	      of alphanumeric ASCII characters	n  characters  or  longer  are
	      added  to	 the  list  of	words in the message.  The dollar sign
	      (``$'')  is  considered  an  alphanumeric	 character  for	 these
	      purposes,	 and  words may have embedded hyphens and apostrophes,
	      but may not begin or end with those characters.  If --binword is
	      set   to	zero,  scanning	 of  binary  attachments  is  disabled
	      entirely.	 The default setting is 5 characters.

       --bsdfolder
	      The next --mail or --junk folder will be parsed using  ``classic
	      BSD''  rules for identifying the start of individual messages in
	      the folder.  In BSD-style folders, the  text  ``From ''  as  the
	      leftmost	characters of a line always denotes the start of a new
	      message: any appearance of this text in  any  other  context  is
	      always  quoted,  often  by  prefixing a ``>'' character.	In the
	      default Unix folder syntax, ``From '' only marks the start of  a
	      new  message  if	it  appears following one or more blank lines.
	      Note that you must specify --bsdfolder before each folder to  be
	      read with BSD rules; it is not a modal setting.

       --classify fname
	      Classify	mail  in  fname.   If  it  equals  or exceeds the junk
	      threshold (see --threshjunk), ``JUNK'' is	 written  to  standard
	      output  and the program exits with status code 3. If the message
	      scores  less  than  or  equal  to	 the   mail   threshold	  (see
	      --threshmail),  ``MAIL''	is  written to standard output and the
	      program exits with status	 0.   If  the  message's  score	 falls
	      between the two thresholds, its content is deemed indeterminate;
	      ``INDT'' is written to standard output  and  the	program	 exits
	      with  a  status  of  4.	The  output  can  be  used  to	set an
	      environment variable in Procmail to control the  disposition  of
	      the  message.   If  fname	 is  ``-''  the	 message  is read from
	      standard input.

       --clearjunk
	      Clear appearances of words in junk  mail	from  database.	  Used
	      when preparing a database of legitimate mail.

       --clearmail
	      Clear  appearances  of  words  in legitimate mail from database.
	      Used when preparing a database of junk mail.

       --copyright
	      Print copyright information.

       --csvread fname
	      Import a dictionary from	a  comma-separated  value  (CSV)  file
	      fname.   Records	are  assumed  to  be  in the format written by
	      --csvwrite but need not  be  sorted  in  any  particular	order.
	      Words are added to those already in memory.

       --csvwrite fname
	      Export  a dictionary as a comma-separated value (CSV) fname with
	      this option.  Such files	can  be	 loaded	 into  spreadsheet  or
	      database	programs  for  further	processing.   Words are sorted
	      first in ascending order of probability they denote  junk	 mail,
	      then lexically.

       --fread, -r fname
	      Load  a  fast  dictionary	 (previously created with the --fwrite
	      option) from file fname.

       --fwrite fname
	      Write a dictionary to the file fname in fast dictionary  format.
	      Fast  dictionaries  are  written in a binary format which is not
	      portable across machines with different byte  order  conventions
	      and   cannot   be	 added	incrementally  to  assemble  a	larger
	      dictionary, but can be loaded in a small fraction	 of  the  time
	      required	by the format created by the --write command.  Using a
	      fast dictionary for  routine  classification  of	incoming  mail
	      drastically  reduces the time consumed in loading the dictionary
	      for each message.

       --help, -u
	      Print how-to-call information including a list of options.

       --junk, -j fname
	      Add the mail in folder fname to the  dictionary  as  junk	 mail.
	      These folders may be compressed by a utility the host system can
	      uncompress;  specify  the	 complete  file	 name  including   the
	      extension	 denoting  its form of compression.  If fname is ``-''
	      the mail folder is read from standard input.

       --list List the dictionary on standard output.

       --mail, -m fname
	      Add the mail in folder fname to  the  dictionary	as  legitimate
	      mail.   These  folders  may  be compressed by a utility the host
	      system can uncompress; specify the complete file name  including
	      the  extension  denoting	its  form of compression.  If fname is
	      ``-'' the mail folder is read from standard input.

       --newword n
	      The probability that a word seen in mail which does  not	appear
	      in  the  dictionary  (or	appeared  too few times to assign it a
	      probability with acceptable confidence) is indicative of junk is
	      set  to n.  The default is 0.2—the odds are that novel words are
	      more likely to appear in legitimate mail than in junk.

       --pdiag fname
	      Write a diagnostic file to the specified	fname  containing  the
	      actual  lines the parser processed (after decoding of MIME parts
	      and exclusion of data deemed unparseable).  Use this option when
	      you suspect problems in decoding or pre-parser filtering.

       --phraselimit n
	      Limit   the   length  of	phrases	 assembled  according  to  the
	      --phrasemin and  --phrasemax  options  to	 n  characters.	  This
	      permits  ignoring	 ``phrases'' consisting of gibberish from mail
	      headers and un-decoded content.  In most cases these items  will
	      be discarded by a --prune in any case, but skipping them as they
	      are generated keeps the dictionary from bloating	in  the	 first
	      place.  The default value is 48 characters.

       --phrasemin n
	      Calculate	 probabilities of phrases consisting of a minumum of n
	      words.  The default of 1	calculates  probabilities  for	single
	      words.

       --phrasemax n
	      Calculate	 probabilities of phrases consisting of a maximum of n
	      words.  The default of 1	calculates  probabilities  for	single
	      words.  If you set this too large, the dictionary may grow to an
	      absurd size.

       --plot fname
	      After loading the dictionary, create a plot in fname .png of the
	      histogram of words, binned by their probability of appearance in
	      junk mail.  In order to generate the histogram the  GNUPLOT  and
	      NETPbm  utilities	 must  be installed on the system; if they are
	      absent, the --plot option will not be available.

       --pop3port n
	      The POP3 proxy server activated  by  a  subsequent  --pop3server
	      option  will listen for connections on port n.  If no --pop3port
	      is specified, the server will listen  on	the  default  port  of
	      9110.   On  most systems, you'll have to run the program as root
	      if you wish the proxy server to listen on a port	numbered  1023
	      or less.

       --pop3server server[:port]
	      Activate	a  POP3 proxy server which relays requests made on the
	      previously specified --pop3port or the default  of  9110	if  no
	      port  is	specified, to the specified server, which may be given
	      either as an IP  address	in  ``dotted  quad''  notion  such  as
	      10.89.11.131    or    a	fully-qualified	  domain   name	  like
	      pop.someisp.tld.	The port on which the server listens for  POP3
	      connections  may	be  specified  after  the server prefixed by a
	      colon (``:'') ; if no port is specified, the IANA assigned  POP3
	      port  110	 will  be  used.  The POP3 proxy server will pass each
	      message received on behalf of a requestor through the classifier
	      and  return  the	annotated transcript to the requestor, who may
	      then filter it based  on	the  classification  appended  to  the
	      message header. You must load a dictionary before activating the
	      POP3 proxy server, and the --pop3server option must be the  last
	      on  the  command	line.  The server continues to run and service
	      requests until manually terminated.

       --pop3trace
	      Write a trace of POP3 proxy server operations to standard error.
	      Each  trace  message  (apart from the dump of the body of multi-
	      line replies to clients) is prefixed with the label ``POP3: ''.

       --prune
	      After loading the dictionary from	 --mail	 and  --junk  folders,
	      this   option   discards	 words	 which	 appear	  sufficiently
	      infrequently  that  their	  probability	cannot	 be   reliably
	      estimated.   One	usually	 --prune s the dictionary before using
	      --write to save it for subsequent runs.

       --ptrace
	      Include a token-by-token trace in the --pdiag output file.  This
	      helps  when  adjusting  the  parser's  criteria  for recognising
	      tokens.  Setting this option without also specifying  a  --pdiag
	      file  will  have	no  effect other than perhaps to exercise your
	      fingers typing it on the command line.

       --read, -r fname
	      Load a dictionary (previously created with the  --write  option)
	      from file fname.

       --sigwords n
	      The probability that a message is junk will be computed based on
	      the individual  probabilities  of	 the  n	 words	with  extremal
	      probabilities; that is, probabilities most indicative of junk or
	      mail.  The default is 15, but there's no obvious optimal setting
	      for  this parameter; it depends in part on the average length of
	      messages you receive.

       --statistics
	      After loading the dictionary from	 --mail	 and  --junk  folders,
	      print  statistics	 of  the distribution of junk probabilities of
	      words in the dictionary.	The statistics are written to standard
	      output.

       --test, -t fname
	      Test  mail  in  fname  and write the estimated probability it is
	      junk to standard output unless the --transcript option  is  also
	      specified	 with  standard	 output (``-'') as the destination, in
	      which case the inclusion of the probability  and	classification
	      in  the  transcript  is  adjudged	 sufficient.  If the --verbose
	      option is specified, the individual probabilities of the	``most
	      interesting''  words  in	the  message  will also be output.  If
	      fname is ``-'' the message is read from standard input.

       --threshjunk n
	      Set the threshold for classifying	 a  message  as	 junk  to  the
	      floating	point  probability  value n.  The default threshold is
	      0.9; messages scored above --threshjunk are deemed junk.

       --threshmail n
	      Set the threshold for classifying a message as  legitimate  mail
	      to   the	floating  point	 probability  value  n.	  The  default
	      threshold is 0.9, with messages scored below --threshmail deemed
	      legitimate.    Note  that	 you  may  leave  a  gap  between  the
	      --threshmail and --threshjunk values (although it makes no sense
	      to  set  --threshmail  higher).	Mail  scored  between  the two
	      thresholds will then be judged of uncertain status.

       --transcript fname
	      Write an annotated transcript of the  original  message  to  the
	      specified	 fname.	  If fname is ``-'', the transcript is written
	      to standard output.  At  the  end	 of  the  message  header,  an
	      X-Annoyance-Filter-Junk-Probability   header   item  giving  the
	      computed probability  and	 an  X-Annoyance-Filter-Classification
	      item  which gives the classification of the message according to
	      the --threshmail and --threshjunk settings;  the	classification
	      is given as ``Mail'', ``Junk'', or ``Indeterminate''.

       --verbose, -v
	      Print  diagnostic	 information  as  the program performs various
	      operations.

       --version
	      Print program version information.

       --write fname
	      Write a dictionary to the file fname.  The dictionary is written
	      in  a  binary format which may be loaded on subsequent runs with
	      the --read option.  Binary dictionary files are  portable	 among
	      machines with different architectures and byte order.

EXIT STATUS
       The  program  exits  with a status of 0 when processing is successfully
       completed, 1 when an error (I/O or file access in most  cases)  occurs,
       and  2  to  indicate  a	command	 line syntax error.  If the --classify
       option is specified, an exit status of 0 identifies the message	tested
       as  legitimate  mail, 3 marks it as junk, and a status of 4 is returned
       for messages which cannot be confidently classified as either  mail  or
       junk.

FILES
       Files  are read or written as requested by options on the command line;
       all options which read or write files take a fname argument which gives
       the   file   name.    The   --classify,	--junk,	 --mail,  --test,  and
       --transcript  options  interpret	 an  argument  of  ``-''  as  denoting
       standard input or output.

       On systems which provide the required services and utilities, arguments
       to the --junk and --mail options may be compressed files or the name of
       a  directory  containing	 one or more messages which will be read as if
       logically concatenated.	Messages in the directory may be compressed or
       uncompressed.

       Error  messages	and  diagnostic	 output	 generated  when the --verbose
       option is specified are written to standard error.

BUGS
       Millions, doubtless.  This is a program which must cope	with  whatever
       garbage	is fed to it from mail folders, trying to make the best of it.
       When it messes up, your efforts in identifying the message which caused
       the  problem  and submitting a verbatim copy of it with your bug report
       are much appreciated.

       Please report bugs to bugs@fourmilab.ch and include annoyance-filter in
       the Subject line.  Thanks in advance.

AUTHOR
				     John Walker
			      http://www.fourmilab.ch/

       This software is in the public domain. Permission to use, copy, modify,
       and distribute this software and its documentation for any purpose  and
       without	fee is hereby granted, without any conditions or restrictions.
       This  software  is  provided  ``as  is''	 without  express  or  implied
       warranty.

SEE ALSO
       gnuplot(1), gs(1), gzip(1), netpbm(1), procmail(1), xpdf(1)

       annoyance-filter	   is	written	  using	  the	Literate   Programming
       http://www.literateprogramming.com/  methodology;  the	user   manual,
       program,	 and  internal	documentation  are developed together, closely
       interlinked.  Whenever the program is modified,	the  documentation  is
       automatically updated, reducing the risk of divergence between what the
       manual says and what the program does.

       This man page is intended as a reference for the command	 line  options
       and  most  common  applications	of  the	 program.   For	 comprehensive
       documentation, including details of how to  integrate  annoyance-filter
       with  the procmail mail processing system, please refer to the complete
       documentation published in PDF format, available on the Web at:
	    http://www.fourmilab.ch/annoyance-filter/annoyance-filter.pdf

       If you have downloaded the annoyance-filter  source  distribution,  the
       corresponding  version  of  annoyance-filter.pdf	 is  included  in  the
       archive.	 You can read PDF files with Acrobat reader (a	free  download
       from   http://www.adobe.com/acrobat/readstep.html)   or	 the  xpdf  or
       Ghostscript (gs) utilities.

4th Berkeley Distribution	  19 FEB 2003		   ANNOYANCE-FILTER(1)
[top]

List of man pages available for Cygwin

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net