SCANMAIL(8)SCANMAIL(8)NAME
scanmail, testscan - spam filters
SYNOPSIS
upas/scanmail [ options ] [ qer-args ] root mail sender system rcpt-
list
upas/testscan [ -avd ] [ -p patfile ] [ filename ]
DESCRIPTION
Scanmail accepts a mail message supplied on standard input, applies a
file of patterns to a portion of it, and dispatches the message based
on the results. It exactly replaces the generic queuing command qer(8)
that is executed from the rc(1) script /mail/lib/qmail in the mail pro‐
cessing pipeline. Associated with each pattern is an action in order
of decreasing priority:
dump the message is deleted and a log entry is written to
/sys/log/smtpd
hold the message is placed in a queue for human inspection
log a line containing the matching portion of the message is
written to a log
If no pattern matches or only patterns with an action of log match, the
message is accepted and scanmail queues the message for delivery.
Scanmail meshes with the blocking facilities of smtpd(6) to provide
several layers of filtering on gateway systems. In all cases the
sender is notified that the message has been successfully delivered,
leaving the sender unaware that the message has been potentially
delayed or deleted.
Scanmail accepts the arguments of qer(8) as well as the following:
-c Save a copy of each message in a randomly-named file in direc‐
tory /mail/copy.
-d Write debugging information to standard error.
-h Queue held messages by sending domain name. The -q option must
specify a root directory; messages are queued in subdirectories
of this directory. If the -h option is not specified, messages
are accumulated in a subdirectory of /mail/queue.hold named for
the contents of /dev/user, usually none.
-n Messages are never held for inspection, but are delivered. Also
known as vacation mode.
-p filename
Read the patterns from filename rather than /mail/lib/patterns.
-q holdroot
Queue deliverable messages in subdirectories of holdroot. This
option is the same as the -q option of qer(8) and must be
present if the -h option is given.
-s Save deleted messages. Messages are stored, one per randomly-
named file, in subdirectories of /mail/queue.dump named with the
date.
-t Test mode. The pattern matcher is applied but the message is
discarded and the result is not logged.
-v Print the highest priority match. This is useful with the -t
option for testing the pattern matcher without actually sending
a message.
Testscan is the command line version of scanmail. If filename is miss‐
ing, it applies the pattern set to the message on standard input.
Unlike scanmail, which finds the highest priority match, testscan
prints all matches in the portion of the message under test. It is
useful for testing a pattern set or implementing a personal filter
using the pipeto file in a user's mail directory. Testscan accepts the
following options:
-a Print matches in the complete input message
-d Enable debug mode
-v Print the message after conversion to canonical form (q.v.).
-p filename
Read the patterns from filename rather than /mail/lib/patterns.
Canonicalization
Before pattern matching, both programs convert a portion of the message
header and the beginning of the message to a canonical form. The
amount of the header and message body processed are set by compile-time
parameters in the source files. The canonicalization process converts
letters to lower-case and replaces consecutive spaces, tabs and newline
characters with a single space. HTML commands are deleted except for
the parameters following A HREF, IMG SRC, and IMG BORDER directives.
Additionally, the following MIME escape sequences are replaced by their
ASCII equivalents:
Escape Seq ASCII
---------------
=2e .
=2f /
=20 <space>
=3d =
and the sequence =<newline> is elided. Scanmail assembles the sender,
destination domain and recipient fields of the command line into a
string that is subjected to the same canonical processing. Following
canonicalization, the command line and the two long strings containing
the header and the message body are passed to the matching engine for
analysis.
Pattern Syntax
The matching engine compiles the pattern set and matches it to each
canonicalized input string. Patterns are specified one per line as
follows:
{*}action: pattern-spec {~~override...~~override}
On all lines, a # introduces a comment; there is no way to escape this
character.
Lines beginning with * contain a pattern-spec that is a string; other‐
wise, the the pattern-spec is a regular expression in the style of reg‐
exp(6). Regular expression matching is many times less efficient than
string matching, so it is wiser to enumerate several similar strings
than to combine them into a regular expression. The action is a key‐
word terminated by a : and separated from the pattern by optional
white-space. It must be one of the following:
dump if the pattern matches, the message is deleted. If the -s
command line option is set, the message is saved.
hold if the pattern matches, the message is queued in a subdirec‐
tory of /mail/queue.hold for manual inspection. After
inspection, the queue can be swept manually using runq (see
qer(8)) to deliver messages that were inadvertently matched.
header this is the same as the hold action, except the pattern is
only applied to the message header. This optimization is
useful for patterns that match header fields that are
unlikely to be present in the body of the message.
line the sender and a section of the message around the match are
written to the file /sys/log/lines. The message is always
delivered.
loff patterns of this type are applied only to the canonicalized
command line. When a match occurs, all patterns with line
actions are disabled. This is useful for limiting the size
of the log file by excluding repetitive messages, such as
those from mailing lists.
Patterns are accumulated into pattern sets sharing the same action.
The matching engine applies the dump pattern set first, then the header
and hold pattern sets, and finally the line pattern set. Each pattern
set is applied three times: to the canonicalized command line, to the
message header, and finally to the message body. The ordering of pat‐
terns in the pattern file is insignificant.
The pattern-spec is a string of characters terminated by a newline, #
or override indicator, ~~. Trailing white-space is deleted but pat‐
terns containing leading or trailing white-space can be enclosed in
double-quote characters. A pattern containing a double-quote must be
enclosed in double-quote characters and preceded by a backslash. For
example, the pattern
"this is not \"spam\""
matches the string this is not "spam". The pattern-spec is followed by
zero or more override strings. When the specific pattern matches, each
override is applied and if one matches, it cancels the effect of the
pattern. Overrides must be strings; regular expressions are not sup‐
ported. Each override is introduced by the string ~~ and continues
until a subsequent ~~, # or newline, white-space included. A ~~ imme‐
diately followed by a newline indicates a line continuation and further
overrides continue on the following line. Leading white-space on the
continuation line is ignored. For example,
*hold: sex.com~~essex.com~~sussex.com~~sysex.com~~
lasex.com~~cse.psu.edu!owner-9fans
matches all input containing the string sex.com except for messages
that also contain the strings in the override list. Often it is desir‐
able to override a pattern based on the name of the sender or recipi‐
ent. For this reason, each override pattern is applied to the header
and the command line as well as the section of the canonicalized input
containing the matching data. Thus a pattern matching the command line
or the header searches both the command line and the header for over‐
rides while a match in the body searches the body, header and command
line for overrides.
The structure of the pattern file and the matching algorithm define the
strategy for detecting and filtering unwanted messages. Ideally, a
hold pattern selects a message for inspection and if it is determined
to be undesirable, a specific dump pattern is added to delete further
instances of the message. Additionally, it is often useful to block
the sender by updating the smtpd control file.
In this regime, patterns with a dump action, generally match phrases
that are likely to be unique. Patterns that hold a message for inspec‐
tion match phrases commonly found in undesirable material and occasion‐
ally in legitimate messages. Patterns that log matches are less spe‐
cific yet. In all cases the ability to override a pattern by matching
another string, allows repetitive messages that trigger the pattern,
such as mailing lists, to pass the filter after the first one is pro‐
cessed manually. The -s option allows deleted messages to be salvaged
by either manual or semi-automatic review, supporting the specification
of more aggressive patterns. Finally, the utility of the pattern
matcher is not confined to filtering spam; it is a generally useful
administrative tool for deleting inadvertently harmful messages, for
example, mail loops, stuck senders or viruses. It is also useful for
collecting or counting messages matching certain criteria.
FILES
/mail/lib/patterns
default pattern file
/sys/log/smtpd
log of deleted messages
/mail/log/lines
file where log matches are logged
/mail/queue/*
directories where legitimate messages are queued for delivery
/mail/queue.hold
directory where held messages are queued for inspection
/mail/queue.dump/*
directory where dumped messages are stored when the -s command
line option is specified.
/mail/copy/*
directory where copies of all incoming messages are stored.
SOURCE
/sys/src/cmd/upas/scanmail
SEE ALSOmail(1), qer(8), smtpd(6)BUGS
Testscan does not report a match when the body of a message contains
exactly one line.
SCANMAIL(8)