Neural-network text classifier for SPAM-filtering witten in python3, using Milter API for MTA integration. This is a complete rewrite of https://github.com/gereoffy/deepspam1 using PyTorch and SPM tokenizer.
Release contains pre-trained model for detecting Hungarian & English language spam emails.
- deepspam1 tf2-keras model ported to pytorch
- new model using SentencePiece and embedding from pretrained XLM
- new metrics for model eval, optimized for spam filtering
- dataset found to be wrong, requires a full review and cleanup :(
- old mailer3 unable to handle dataset properly (utf8, bad html etc)
- mailer3 ported to python3 -> mailer4 initial version (same ui/keys)
- wide unicode display issues -> wcwidth/widechars added/implemented...
- old html2txt used in deepspam1 found to be wrong... new html parser developed!
- python's mime email parser found to be sloooow and sometimes broken -> implemented my own
- mailer4: integrated deepspam model evaluation, see Screenshot-mailer4-diffmode.png
- mailer4: added search, selection, deduplication, tagging features - tested on real data
- better charmap filtering (charfilter6.py & unicodes6x.map) to remove unicode shit...
- XML's embedding replaced by custom multi-language word2vec pretrained vectors & spm6 model
- hyperparameter search implemented (bs/lr), new model eval code to find best training result
- re-implemented the milter API from scratch using native asyncio -> deepspam4.py
- first stable/public release! (after testing in production for a week)
- visualization (asciiart/matplotlib) of the model weights (trainging & testing processes)
- email parser: refactor/rewrite hdrdecode & parse_from functions of old hdrdecode.py
- torch_emb.py: direct rewrite of deepspam1's model train code to pytorch
- torch_spm3.py: new model trainer, uses py class from model/ dir
- torch_eval.py: model eval & compare
- maildedup3.py: email deduplication and parsing, from mbox to txt
- mailer4.py: Python3 version of my old email reader, used primarily for spam-dataset preparation & model eval
- eml2str.py: all of my mime email parser and html2txt converter functions
- ttykeymap.py: python version of my old getch2.c, console i/o functions and text UI
- widechars.py: wide unicode utilities, based on https://github.com/jquast/wcwidth
- striprtf.py: rtf to txt converter, from https://github.com/joshy/striprtf