subayes is a command-line tool that uses a Bayesian filter to classify email subjects as either "Ham" (legitimate) or "Spam" (unsolicited/junk).
It learns from a user-provided training dataset of pre-classified subjects, building a statistical model of word frequencies associated with each category.
Once trained, subayes can analyze new, unseen subjects and provide a probability score indicating the likelihood of them being spam.
This allows mail admins to quickly identify users sending spam based on their email subjects.
The tool offers options for training the filter, classifying individual subjects, and even batch processing a list of subjects from a file, making it a versatile solution for the discovery of spam leaks.
This is a naive bayesian classifier for mail subjects.
bayesian work is done with golang jbrukh/bayesian lib.
Spammer uses a lot of subjects, sometime with wrong spelling and garbage.
Purpose of this project is a basic classifier able to detect spam from mail subjects better than grep.
subayes read stdin line and output them on stdout with prefix "Spam: " or "Ham: ".
Training db is really important, unknown words will be classified with most learned class.
## Building
$ go mod tidy && go build
## Defaults options :
$ subayes -h
Usage of subayes:
-E explain words scores
-d string
data filename (default "subayes.spam")
-db string
db path (default "db")
-H
learn Ham subjects
-S
learn Spam subjects
-m int
word min length (default 4)
-v verbose
## Learning
$ rm db/Spam db/Ham ; mkdir db
$ ./subayes -learnHam -d testdata/Ham -v
INFO classifier corpus : [ Ham -> 0 items ]
INFO classifier corpus : [ Ham -> 4623 items ]
$ ./subayes -learnSpam -d testdata/esteban.txt -v
INFO classifier corpus : [ Spam -> 0 items ]
INFO classifier corpus : [ Spam -> 1096 items ]
## Testing
$ echo "mensaje al grupo de trabajo please" | subayes
Ham: mensaje al grupo de trabajo please
$ echo "View sexy women in your neighborhood" | subayes
Spam: View sexy women in your neighborhood
## Evaluating words scores
$ echo "mensaje al grupo de trabajo please" | subayes -E
[ mensaje = Spam ] : [Ham]{ 0.4000 } [Spam]{ 0.6000 }
[ grupo = Ham ] : [Ham]{ 0.5096 } [Spam]{ 0.4904 }
[ trabajo = Ham ] : [Ham]{ 0.6667 } [Spam]{ 0.3333 }
[ please = Ham ] : [Ham]{ 0.6667 } [Spam]{ 0.3333 }
Ham: mensaje al grupo de trabajo please
## Raw test from v0.1
$ ./subayes.exe < testdata/2023-05 |cut -d: -f1|sort|uniq -c
176347 Ham
57102 Spam
Meaning at least 24% Spam !
Use utf8submimedecode filter to decode utf8 encoded subjects lines.
ex-pat contains lines to ignore patterns ( like Spam, [PUB] or already detected users ).
subjects.sed is a simple sed script extracting subjects from log line.
subayes will create two files in db/ : Spam and Ham
Each time you find a spammer, learn theirs subjects as spam, verify updated db against previous clean data to adjust false positives.
# Detection from clamav logs
logs/partage$ rg -z clamav sftp_logs/$LOGDATE/*clamav.log* \
| rg -vf ex-pat | sed -f subjects.sed | utf8submimedecode \
| sort -u | subayes | rg ^Spam \
| tee subayes.spam | mail -E -s "[subayes detection]" postmaster
# If you want to know what are the words tagged with Spam in a line,
# use "-E" explain option (printed on stderr).
$ subayes -E < subayes.spam
# Learning more Ham words :
# edit subayes.spam (when you have false positives and relearn :)
logs/partage$ subayes -v -learnHam -d subayes.spam
( -d is optional, subayes.spam is the default data file)
# Efficiency :
logs/partage$ subayes < /tmp/Hacked-account-Subjects \
| cut -d: -f1 | sort | uniq -c
5658 Ham
39016 Spam ( meaning 87% detection without false positives from filtered subjects)
Using this db for a postfix milter that would defer these subjects ?