Skip to content

cli for Ham/Spam detection of mail subjects using a trained bayesian filter

License

Notifications You must be signed in to change notification settings

thc2cat/subayes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Subayes

subayes is a command-line tool that uses a Bayesian filter to classify email subjects as either "Ham" (legitimate) or "Spam" (unsolicited/junk).

It learns from a user-provided training dataset of pre-classified subjects, building a statistical model of word frequencies associated with each category.

Once trained, subayes can analyze new, unseen subjects and provide a probability score indicating the likelihood of them being spam.

This allows mail admins to quickly identify users sending spam based on their email subjects.

The tool offers options for training the filter, classifying individual subjects, and even batch processing a list of subjects from a file, making it a versatile solution for the discovery of spam leaks.

This is a naive bayesian classifier for mail subjects.

bayesian work is done with golang jbrukh/bayesian lib.

go.yml Go Reference

Context

Spammer uses a lot of subjects, sometime with wrong spelling and garbage.

Purpose of this project is a basic classifier able to detect spam from mail subjects better than grep.

subayes read stdin line and output them on stdout with prefix "Spam: " or "Ham: ".

Training db is really important, unknown words will be classified with most learned class.

Basics

## Building
$ go mod tidy && go build 

## Defaults options : 
$ subayes -h
Usage of subayes:
  -E    explain words scores
  -d string
        data filename (default "subayes.spam")
  -db string
         db path (default "db")
  -H
        learn Ham subjects
  -S
        learn Spam subjects
  -m int
        word min length (default 4)
  -v    verbose


## Learning
$ rm db/Spam db/Ham ; mkdir db
$ ./subayes  -learnHam -d testdata/Ham -v
INFO classifier corpus :  [ Ham -> 0 items ]
INFO classifier corpus :  [ Ham -> 4623 items ]
$ ./subayes  -learnSpam -d testdata/esteban.txt -v
INFO classifier corpus :  [ Spam -> 0 items ]
INFO classifier corpus :  [ Spam -> 1096 items ]

## Testing 
$ echo "mensaje al grupo de trabajo please" | subayes
Ham: mensaje al grupo de trabajo please

$ echo "View sexy women in your neighborhood" | subayes
Spam: View sexy women in your neighborhood


## Evaluating words scores
$ echo "mensaje al grupo de trabajo please" | subayes -E    
[ mensaje = Spam ] : [Ham]{ 0.4000 } [Spam]{ 0.6000 } 
[ grupo = Ham ] : [Ham]{ 0.5096 } [Spam]{ 0.4904 } 
[ trabajo = Ham ] : [Ham]{ 0.6667 } [Spam]{ 0.3333 } 
[ please = Ham ] : [Ham]{ 0.6667 } [Spam]{ 0.3333 } 
Ham: mensaje al grupo de trabajo please

## Raw test from v0.1
$ ./subayes.exe < testdata/2023-05 |cut -d: -f1|sort|uniq -c
 176347 Ham
  57102 Spam

Meaning at least 24% Spam ! 

Common usage

Use utf8submimedecode filter to decode utf8 encoded subjects lines.

ex-pat contains lines to ignore patterns ( like Spam, [PUB] or already detected users ).

subjects.sed is a simple sed script extracting subjects from log line.

subayes will create two files in db/ : Spam and Ham

Each time you find a spammer, learn theirs subjects as spam, verify updated db against previous clean data to adjust false positives.

# Detection from clamav logs

logs/partage$ rg -z clamav  sftp_logs/$LOGDATE/*clamav.log* \
| rg -vf ex-pat | sed -f subjects.sed  | utf8submimedecode \
| sort -u | subayes | rg ^Spam \
| tee  subayes.spam | mail -E -s "[subayes detection]" postmaster

# If you want to know what are the words tagged with Spam in a line, 
# use "-E" explain option (printed on stderr).

$ subayes -E < subayes.spam  

# Learning more Ham words :  
 # edit subayes.spam  (when you have false positives and relearn :)

logs/partage$ subayes  -v -learnHam -d subayes.spam          
( -d is optional, subayes.spam is the default data file)

# Efficiency :

logs/partage$  subayes < /tmp/Hacked-account-Subjects \
| cut -d: -f1 | sort | uniq -c
5658 Ham
39016 Spam ( meaning 87% detection without false positives from filtered subjects)
                  

Next move

Using this db for a postfix milter that would defer these subjects ?

About

cli for Ham/Spam detection of mail subjects using a trained bayesian filter

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published