User Tools

Site Tools


hints:teaching_system_recognize_spam_not

This is an old revision of the document!


Teaching the system to recognize spam or not-spam

[ Classic Linux. ]

(Analogous information will be made available in the near future for the DirectAdmin environment.)

The spam filtering in Classic Linux includes a self-learning component. As spam arrives, combinations of words in the body are converted into tokens,1) and these tokens are stored in a central database. As the nature of spam evolves, so does the database, so it always reflects recent trends in spam. And whenever the strings inside an arriving email seem to match tokens in the database, the spam score assigned to the email is increased to reflect the certainty of this match.

The general term for this is Bayesian filtering2).

If you look inside the full headers of any email, you will sometimes see strings beginning with BAYES in all caps. The spam score assigned to the email is increased or decreased based on the certainty of the Bayesian filtering system.

Example string Meaning
BAYES_50 Spam with 50% certainty
BAYES_99 Spam with 99% certainty
BAYES_999 Spam with 99.9% certainty
2.0 BAYES_80 BODY: Bayes spam
probability is 80 to 95% [score: 0.8325]
Spam with 83.25% certainty,
spam score increased by 2.0
BAYES_HAM(-2.73) [98.80%] Not-spam with 98.8% certainty,
spam score decreased by 2.73

How the system learns

Suppose the system can't recognize the message text as spam or not-spam, but other properties of a message, e.g., sender, IP addresses, well-known spam domains, type of HTML, and color of fonts, clearly identify it as spam or not-spam. The strings inside the body are now tokenized and added into the database, so future emails can be recognized as spam or not-spam based on these tokens.

If a message is not unambiguously identified as spam or not-spam then its contents don't go into the Bayesian database.

This is where you can help. The spam-filtering system didn't recognize spam, but you the recipient did. Now you can teach the system by telling it that this message was spam.

Likewise, if the system falsely identified a message as spam, but you know it is not, you can teach the system by telling it that this message was not spam.

A common word for not-spam in the Bayesian spam filtering world is: ham.

How you can teach the system

Simply select a bunch of spam messages, and forward them as an attachment to the spam-recognizing email address: spamtrap@rahul.net

Or select a bunch of ham (not-spam) messages, and forward them as an attachment to the ham-recognizing email address: hamtrap@rahul.net.

These email addresses will automatically tokenize the contents of whatever they get, an add these tokens into the database as spam or ham tokens respectively.

Ham tokens are important. They let the Bayesian system learn how to recognize legitimate email, thus subtracting from its spam score, and making it less likely to be erroneously classified as spam. Make it a point to send at least as much ham as you send spam.

If you are forwarding a single message to spamtrap@rahul.net or hamtrap@rahul.net, you can send it as an attachment or you can use the bounce or resend feature of some mail clients. All of these are equally good.

A normal forward not as an attachment often does not include the complete headers, so might not be as useful (but it's still better than nothing).

Budget your time. Life is short. Spend no more than, say, 2–3 minutes per week sending spam or ham. You can do it once a week, or you could spend just 5–10 seconds a day quickly forwarding two or three selected messages.

The most useful messages with which to teach the system are those messages that the system recognized incorrectly. If the system already correctly recognized spam or ham, it won't learn a lot. If the system incorrectly recognized ham as spam or spam as ham, that is where it most needs to learn.

If you accidentally send spam to the hamtrap address, or vice versa, just send the same content again this time to the correct address. The system will unlearn the incorrect content and re-learn it the right way.

Learning occurs in small increments. The system wants a steady diet of mistaken ham and mistaken spam.

Teaching the system from the command line

The Linux shell commands  spamtrap  and  hamtrap can be used. These are simple shell scripts (inside the directory  /usr/local/bin/ ) that mail whatever you feed them to the spamtrap@rahul.net or hamtrap@rahul.net address.

If a message is in a file, specify that as the filename. Multiple files are accepted.

Or you can feed these commands a message on their standard input.

Sending the same thing twice does no harm. The system will recognize duplicate content and not learn from it twice.

If you accidentally feed them spam instead of ham or vice versa, just repeat with the correct command. The system will unlearn the incorrect content and re-learn it the right way.

Debugging the learning

You can use the following Linux shell commands to see whether the system is actually learning. There are two components to the system, Rspamd and SpamAssassin, and both work in sequence to identify spam. The Bayesian learning needs to be done by both. (The email addresses spamtrap@rahul.net and hamtrap@rahul.net feed their input into both systems.) You can use the shell commands below to interact with them.

Purpose Linux command
Teach Rspamd that a message is spam rspamc learn_spam filename 
… with verbose output rspamc -v learn_spam filename 
Teach Rspamd that a message is ham rspamc learn_ham filename 
… with verbose output  rspamc -v learn_ham filename 
Teach SpamAssassin that a message is spam sa-learn --spam filename 
… with verbose output  sa-learn -D --spam filename 
Teach SpamAssassin that a message is ham sa-learn --ham filename 
… with verbose output  sa-learn -D --ham filename 

All these commands will recognize duplicate content and not learn from it twice.

These commands (unlike  spamtrap and  hamtrap ) will also take one or more directory names, and they will cause every file in each directory to be processed.

If no filename is specified, these commands will read from their standard input.

If you accidentally feed them spam instead of ham or vice versa, just repeat with the correct command. The system will unlearn the incorrect content and re-learn it the right way.

1)
A token in this context is the encoded form of a string such that the token can be inserted into a database and found later without having to search for the original string.
2)
Actually Bayesian filtering is a mathematical term with a precise meaning. In the area of spam filtering, this term is used rather loosely to refer to spam detection by extracting strings from within the body of a message and looking them up in a database in a statistical manner.
hints/teaching_system_recognize_spam_not.1613920987.txt.gz · Last modified: 2021/02/21 07:23 by admin