annoyance-filter man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

ANNOYANCE-FILTER(1)					   ANNOYANCE-FILTER(1)

NAME
       annoyance-filter - automatically detect junk mail

SYNOPSIS
       annoyance-filter [ options ]

DESCRIPTION
       annoyance-filter	 uses Bayesian statistics to determine the probability
       an E-mail message is junk based on an analysis of its contents compared
       to collections of known junk and legitimate E-mail.

       The current version of this program is always posted at:
		      http://www.fourmilab.ch/annoyance-filter/
       Please  visit  this page for news about the program and to download the
       latest version.

       The project is hosted on SourceForge,  where  you  will	find  the  CVS
       source code repository and release archives:
		  http://sourceforge.net/projects/annoyancefilter/

USAGE
       annoyance-filter	 has a multitude of options which permit it to be used
       in many different  ways,	 but  the  most	 common	 application  involves
       training	 the  program  with collections of legitimate and junk mail in
       order to create a dictionary which indicates the probability that words
       identify	 a message as junk or non-junk (legitimate).  Training must be
       done before the program is used to classify incoming mail, but need  be
       done   subsequently   only   when   adding  messages  to	 the  training
       collections.  As long as the overall content  of	 the  mail,  junk  and
       legitimate,  which you receive remains pretty much the same, there's no
       need to retrain, but the	 ability  to  do  so  allows  the  program  to
       automatically  adapt to evolving message content, which is particularly
       characteristic of junk mail.

       Suppose you have a collection of legitimate mail (in other words,  mail
       you  wish to read) in a file named m-good and a collection of junk mail
       (that which you don't wish to read) in file m-junk.  These  collections
       may  be in ``Unix mail folder'' format, which is simply the text of one
       or more E-mail messages concatenated together in a single text file, or
       may  be the names of directories containing files, each of which may be
       a single E-mail message or a Unix mail folder.  In either  case,	 if  a
       message	file  is  compressed  with  gzip,  it  will  be	 automatically
       uncompressed on the fly.	 Directories of	 messages  may	not,  however,
       contain other directories of messages.

       To   train   annoyance-filter  with  these  collections	and  create  a
       dictionary, use a command like:

	annoyance-filter --mail m-good --junk m-junk --prune --write dict.bin

       where dict.bin is the name of the dictionary file you wish to create.

       Now that the dictionary has been created, you can use it on  subsequent
       runs  to	 compute  the  probability  a  message is junk and classify it
       accordingly.  Suppose you have an E-mail message in the file  mail.txt.
       To compute its junk priority and display it on standard output, use the
       command:

		  annoyance-filter --read dict.bin --test mail.txt

       To integrate annoyance-filter into a mail  processing  system  such  as
       procmail,  you'll  usually  want	 to  run  it  as  a filter which reads
       incoming	 messages  from	 standard  input  (piped  there	 by  the  mail
       processing system), classifies them and adds annotations to the message
       header indicating the classification,  then  writes  the	 message  with
       header  annotations to standard output.	The mail processing system may
       then examine the header annotations and route the message  accordingly.
       To  filter  a  message,	again  assuming	 the dictionary created by the
       training run is in the file dict.bin, use the command:

	      annoyance-filter --read dict.bin --transcript - --test -

       Here the --transcript option is used to request the  input  message  be
       copied  to  an  output file, in this case standard output, specified by
       ``-'', with the message read from standard input, the ``-'' argument to
       the --test option.

OPTIONS
       Options	are  specified	on  the	 command line.	Options are treated as
       commands—most instruct the program to  perform  some  specific  action;
       consequently,  the  order  in  which they are specified is significant;
       they are processed left to right. Long options  beginning  with	``--''
       may  be	abbreviated  to	 any unambiguous prefix; single-letter options
       introduced by a single ``-'' without arguments may be aggregated.

       --annotate options
		 Add the annotations requested by the characters in options to
		 the  transcript  generated by the --transcript option.	 Upper
		 and lower case options are  treated  identically.   Available
		 annotations are:
			     d	      Decoder diagnostics
			     p	      Parser warnings and error messages
			       w	  Most	significant  words  and	 their
		 probabilities

       --autoprune n
	      As the dictionary is bring built by appending mail  to  it  with
	      the  --mail  and --junk options, unique words will automatically
	      be pruned from it whenever the dictionary exceeds	 approximately
	      n	  bytes.   This	 is  particularly  handy  when	loading	 large
	      collections of messages with --phrasemax set greater  than  one,
	      as  a  very  large  number  of  unique  phrases  may clutter the
	      dictionary being built and exceed the memory  capacity  of  your
	      computer.	  You  could  split  the mail collection into multiple
	      parts and explicitly --prune after each part, but --autoprune is
	      much more convenient.

       --biasmail n
	      The  frequency of words appearing in legitimate mail is inflated
	      by the floating point factor  n,	which  defaults	 to  2.	  This
	      biases  the  classification  of  messages	 in  favour of ``false
	      negatives''—junk mail  deemed  legitimate,  while	 reducing  the
	      probability  of ``false positives'' (legitimate mail erroneously
	      classified as junk, which is bad).  The higher  the  setting  of
	      --biasmail,  the	greater	 the bias in favour of false negatives
	      will be.

       --binword n
	      Binary  character	  streams   (for   example,   attachments   of
	      application-specific  files,  including  the  executable code of
	      worm and virus attachments) are scanned and contiguous sequences
	      of  alphanumeric	ASCII  characters  n  characters or longer are
	      added to the list of words in  the  message.   The  dollar  sign
	      (``$'')  is  considered  an  alphanumeric	 character  for	 these
	      purposes, and words may have embedded hyphens  and  apostrophes,
	      but may not begin or end with those characters.  If --binword is
	      set  to  zero,  scanning	of  binary  attachments	 is   disabled
	      entirely.	 The default setting is 5 characters.

       --bsdfolder
	      The  next --mail or --junk folder will be parsed using ``classic
	      BSD'' rules for identifying the start of individual messages  in
	      the  folder.   In	 BSD-style  folders, the text ``From '' as the
	      leftmost characters of a line always denotes the start of a  new
	      message:	any  appearance	 of  this text in any other context is
	      always quoted, often by prefixing a  ``>''  character.   In  the
	      default  Unix folder syntax, ``From '' only marks the start of a
	      new message if it appears following one  or  more	 blank	lines.
	      Note  that you must specify --bsdfolder before each folder to be
	      read with BSD rules; it is not a modal setting.

       --classify fname
	      Classify mail in fname.	If  it	equals	or  exceeds  the  junk
	      threshold	 (see  --threshjunk),  ``JUNK'' is written to standard
	      output and the program exits with status code 3. If the  message
	      scores   less   than   or	 equal	to  the	 mail  threshold  (see
	      --threshmail), ``MAIL'' is written to standard  output  and  the
	      program  exits  with  status  0.	 If  the message's score falls
	      between the two thresholds, its content is deemed indeterminate;
	      ``INDT''	is  written  to	 standard output and the program exits
	      with a  status  of  4.   The  output  can	 be  used  to  set  an
	      environment  variable  in Procmail to control the disposition of
	      the message.  If	fname  is  ``-''  the  message	is  read  from
	      standard input.

       --clearjunk
	      Clear  appearances  of  words  in junk mail from database.  Used
	      when preparing a database of legitimate mail.

       --clearmail
	      Clear appearances of words in  legitimate	 mail  from  database.
	      Used when preparing a database of junk mail.

       --copyright
	      Print copyright information.

       --csvread fname
	      Import  a	 dictionary  from  a  comma-separated value (CSV) file
	      fname.  Records are assumed to  be  in  the  format  written  by
	      --csvwrite  but  need  not  be  sorted  in any particular order.
	      Words are added to those already in memory.

       --csvwrite fname
	      Export a dictionary as a comma-separated value (CSV) fname  with
	      this  option.   Such  files  can	be  loaded into spreadsheet or
	      database programs for  further  processing.   Words  are	sorted
	      first  in	 ascending order of probability they denote junk mail,
	      then lexically.

       --fread, -r fname
	      Load a fast dictionary (previously  created  with	 the  --fwrite
	      option) from file fname.

       --fwrite fname
	      Write  a dictionary to the file fname in fast dictionary format.
	      Fast dictionaries are written in a binary format	which  is  not
	      portable	across	machines with different byte order conventions
	      and  cannot  be  added  incrementally  to	 assemble   a	larger
	      dictionary,  but	can  be loaded in a small fraction of the time
	      required by the format created by the --write command.  Using  a
	      fast  dictionary	for  routine  classification  of incoming mail
	      drastically reduces the time consumed in loading the  dictionary
	      for each message.

       --help, -u
	      Print how-to-call information including a list of options.

       --junk, -j fname
	      Add  the	mail  in  folder fname to the dictionary as junk mail.
	      These folders may be compressed by a utility the host system can
	      uncompress;   specify  the  complete  file  name	including  the
	      extension denoting its form of compression.  If fname  is	 ``-''
	      the mail folder is read from standard input.

       --list List the dictionary on standard output.

       --mail, -m fname
	      Add  the	mail  in  folder fname to the dictionary as legitimate
	      mail.  These folders may be compressed by	 a  utility  the  host
	      system  can uncompress; specify the complete file name including
	      the extension denoting its form of  compression.	 If  fname  is
	      ``-'' the mail folder is read from standard input.

       --newword n
	      The  probability	that a word seen in mail which does not appear
	      in the dictionary (or appeared too few  times  to	 assign	 it  a
	      probability with acceptable confidence) is indicative of junk is
	      set to n.	 The default is 0.2—the odds are that novel words  are
	      more likely to appear in legitimate mail than in junk.

       --pdiag fname
	      Write  a	diagnostic  file to the specified fname containing the
	      actual lines the parser processed (after decoding of MIME	 parts
	      and exclusion of data deemed unparseable).  Use this option when
	      you suspect problems in decoding or pre-parser filtering.

       --phraselimit n
	      Limit  the  length  of  phrases  assembled  according   to   the
	      --phrasemin  and	--phrasemax  options  to  n  characters.  This
	      permits ignoring ``phrases'' consisting of gibberish  from  mail
	      headers  and un-decoded content.	In most cases these items will
	      be discarded by a --prune in any case, but skipping them as they
	      are  generated  keeps  the dictionary from bloating in the first
	      place.  The default value is 48 characters.

       --phrasemin n
	      Calculate probabilities of phrases consisting of a minumum of  n
	      words.   The  default  of	 1 calculates probabilities for single
	      words.

       --phrasemax n
	      Calculate probabilities of phrases consisting of a maximum of  n
	      words.   The  default  of	 1 calculates probabilities for single
	      words.  If you set this too large, the dictionary may grow to an
	      absurd size.

       --plot fname
	      After loading the dictionary, create a plot in fname .png of the
	      histogram of words, binned by their probability of appearance in
	      junk  mail.   In order to generate the histogram the GNUPLOT and
	      NETPbm utilities must be installed on the system;	 if  they  are
	      absent, the --plot option will not be available.

       --pop3port n
	      The  POP3	 proxy	server	activated by a subsequent --pop3server
	      option will listen for connections on port n.  If no  --pop3port
	      is  specified,  the  server  will	 listen on the default port of
	      9110.  On most systems, you'll have to run the program  as  root
	      if  you  wish the proxy server to listen on a port numbered 1023
	      or less.

       --pop3server server[:port]
	      Activate a POP3 proxy server which relays requests made  on  the
	      previously  specified  --pop3port	 or  the default of 9110 if no
	      port is specified, to the specified server, which may  be	 given
	      either  as  an  IP  address  in  ``dotted	 quad'' notion such as
	      10.89.11.131   or	  a   fully-qualified	domain	  name	  like
	      pop.someisp.tld.	 The port on which the server listens for POP3
	      connections may be specified after  the  server  prefixed	 by  a
	      colon  (``:'') ; if no port is specified, the IANA assigned POP3
	      port 110 will be used. The POP3  proxy  server  will  pass  each
	      message received on behalf of a requestor through the classifier
	      and return the annotated transcript to the  requestor,  who  may
	      then  filter  it	based  on  the	classification appended to the
	      message header. You must load a dictionary before activating the
	      POP3  proxy server, and the --pop3server option must be the last
	      on the command line.  The server continues to  run  and  service
	      requests until manually terminated.

       --pop3trace
	      Write a trace of POP3 proxy server operations to standard error.
	      Each trace message (apart from the dump of the  body  of	multi-
	      line replies to clients) is prefixed with the label ``POP3: ''.

       --prune
	      After  loading  the  dictionary  from --mail and --junk folders,
	      this   option   discards	 words	 which	 appear	  sufficiently
	      infrequently   that   their   probability	  cannot  be  reliably
	      estimated.  One usually --prune s the  dictionary	 before	 using
	      --write to save it for subsequent runs.

       --ptrace
	      Include a token-by-token trace in the --pdiag output file.  This
	      helps when  adjusting  the  parser's  criteria  for  recognising
	      tokens.	Setting	 this option without also specifying a --pdiag
	      file will have no effect other than  perhaps  to	exercise  your
	      fingers typing it on the command line.

       --read, -r fname
	      Load  a  dictionary (previously created with the --write option)
	      from file fname.

       --sigwords n
	      The probability that a message is junk will be computed based on
	      the  individual  probabilities  of  the  n  words	 with extremal
	      probabilities; that is, probabilities most indicative of junk or
	      mail.  The default is 15, but there's no obvious optimal setting
	      for this parameter; it depends in part on the average length  of
	      messages you receive.

       --sloppyheaders
	      To  evade	 filtering  programs, some junk mail is sent with MIME
	      part headers which violate the  standard	but  which  most  mail
	      clients  accept  anyway.	This option causes such messages to be
	      parsed as a browser would, at the cost of standards  compliance.
	      If  --sloppyheaders  is  used,  it should be specified both when
	      building the dictionary and when testing messages.

       --statistics
	      After loading the dictionary from	 --mail	 and  --junk  folders,
	      print  statistics	 of  the distribution of junk probabilities of
	      words in the dictionary.	The statistics are written to standard
	      output.

       --test, -t fname
	      Test  mail  in  fname  and write the estimated probability it is
	      junk to standard output unless the --transcript option  is  also
	      specified	 with  standard	 output (``-'') as the destination, in
	      which case the inclusion of the probability  and	classification
	      in  the  transcript  is  adjudged	 sufficient.  If the --verbose
	      option is specified, the individual probabilities of the	``most
	      interesting''  words  in	the  message  will also be output.  If
	      fname is ``-'' the message is read from standard input.

       --threshjunk n
	      Set the threshold for classifying	 a  message  as	 junk  to  the
	      floating	point  probability  value n.  The default threshold is
	      0.9; messages scored above --threshjunk are deemed junk.

       --threshmail n
	      Set the threshold for classifying a message as  legitimate  mail
	      to   the	floating  point	 probability  value  n.	  The  default
	      threshold is 0.9, with messages scored below --threshmail deemed
	      legitimate.    Note  that	 you  may  leave  a  gap  between  the
	      --threshmail and --threshjunk values (although it makes no sense
	      to  set  --threshmail  higher).	Mail  scored  between  the two
	      thresholds will then be judged of uncertain status.

       --transcript fname
	      Write an annotated transcript of the  original  message  to  the
	      specified	 fname.	  If fname is ``-'', the transcript is written
	      to standard output.  At  the  end	 of  the  message  header,  an
	      X-Annoyance-Filter-Junk-Probability   header   item  giving  the
	      computed probability  and	 an  X-Annoyance-Filter-Classification
	      item  which gives the classification of the message according to
	      the --threshmail and --threshjunk settings;  the	classification
	      is given as ``Mail'', ``Junk'', or ``Indeterminate''.

       --verbose, -v
	      Print  diagnostic	 information  as  the program performs various
	      operations.

       --version
	      Print program version information.

       --write fname
	      Write a dictionary to the file fname.  The dictionary is written
	      in  a  binary format which may be loaded on subsequent runs with
	      the --read option.  Binary dictionary files are  portable	 among
	      machines with different architectures and byte order.

EXIT STATUS
       The  program  exits  with a status of 0 when processing is successfully
       completed, 1 when an error (I/O or file access in most  cases)  occurs,
       and  2  to  indicate  a	command	 line syntax error.  If the --classify
       option is specified, an exit status of 0 identifies the message	tested
       as  legitimate  mail, 3 marks it as junk, and a status of 4 is returned
       for messages which cannot be confidently classified as either  mail  or
       junk.

FILES
       Files  are read or written as requested by options on the command line;
       all options which read or write files take a fname argument which gives
       the   file   name.    The   --classify,	--junk,	 --mail,  --test,  and
       --transcript  options  interpret	 an  argument  of  ``-''  as  denoting
       standard input or output.

       On systems which provide the required services and utilities, arguments
       to the --junk and --mail options may be compressed files or the name of
       a  directory  containing	 one or more messages which will be read as if
       logically concatenated.	Messages in the directory may be compressed or
       uncompressed.

       Error  messages	and  diagnostic	 output	 generated  when the --verbose
       option is specified are written to standard error.

BUGS
       Millions, doubtless.  This is a program which must cope	with  whatever
       garbage	is fed to it from mail folders, trying to make the best of it.
       When it messes up, your efforts in identifying the message which caused
       the  problem  and submitting a verbatim copy of it with your bug report
       are much appreciated.

       Please report bugs to bugs@fourmilab.ch and include annoyance-filter in
       the Subject line.  Thanks in advance.

AUTHOR
				     John Walker
			      http://www.fourmilab.ch/

       This software is in the public domain. Permission to use, copy, modify,
       and distribute this software and its documentation for any purpose  and
       without	fee is hereby granted, without any conditions or restrictions.
       This  software  is  provided  ``as  is''	 without  express  or  implied
       warranty.

SEE ALSO
       gnuplot(1), gs(1), gzip(1), netpbm(1), procmail(1), xpdf(1)

       annoyance-filter	   is	written	  using	  the	Literate   Programming
       http://www.literateprogramming.com/  methodology;  the	user   manual,
       program,	 and  internal	documentation  are developed together, closely
       interlinked.  Whenever the program is modified,	the  documentation  is
       automatically updated, reducing the risk of divergence between what the
       manual says and what the program does.

       This man page is intended as a reference for the command	 line  options
       and  most  common  applications	of  the	 program.   For	 comprehensive
       documentation, including details of how to  integrate  annoyance-filter
       with  the procmail mail processing system, please refer to the complete
       documentation published in PDF format, available on the Web at:
	    http://www.fourmilab.ch/annoyance-filter/annoyance-filter.pdf

       If you have downloaded the annoyance-filter  source  distribution,  the
       corresponding  version  of  annoyance-filter.pdf	 is  included  in  the
       archive.	 You can read PDF files with Acrobat reader (a	free  download
       from   http://www.adobe.com/acrobat/readstep.html)   or	 the  xpdf  or
       Ghostscript (gs) utilities.

4th Berkeley Distribution	  4 AUG 2004		   ANNOYANCE-FILTER(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net