La version française peut-être trouvée ici
The unsubscribe.sh
script allows the mass unsubscription from unwanted mailing lists. It is based on the List-Unsubscribe
field defined by RFC 2369 (July 1998), this field is generally used in French advertising emails. This field contains links <mailto:>
and/or <http:>
(or <https:>
). This script also detects non-standard X-List-Unsubscribe
fields and the lowercase field list-unsubscribe
, or http links without <>.
- This script has been tested with the Mozilla Thunderbird email software but should work with any software that stores emails in text files.
- This
bash
script is essentially based on thegrep
andwget
commands, which you may need to install separately. It should work on any UNIX-like system, including MSYS2 on Windows. But since there are many variants of thegrep
commands, if you encounter problems try to install GNU grep (it has been tested with version 3.3). Finally, with another shell, some minor modifications should be enough: in particular, the-o pipefail
option can be removed without any problem. - Clone the GitHub repository or download and extract the zip into a directory.
- You can work on the "Spam" directory of your account or create a special directory in which you will move the emails you want to unsubscribe from.
- The e-mails must be physically present (downloaded) on your local hard drive (displaying the subject of the e-mail is not enough). In Thunderbird, if they have not been marked as read, just select all the emails and right click on "Read selected messages". Or in the folder properties go to the "Synchronization" tab and click on the "Download Now" button.
- For security reasons, it is advised to delete any e-mails that may present a risk (phishing...) from the directory that will be used (in your e-mail client), even if the use of
wget
to connect to the web theoretically limits the risks (compared to using a browser). - Use the "compact folders" function in order to permanently delete from the file the e-mails already "deleted" (in fact simply deleted from the index).
- Locate the path in the file system of the file or directory containing the e-mails to be processed. You can also work on a copy.
- Run the script either by providing a file directly :
./unsubscribe.sh ~/.thunderbird/rfjzi2xb.default/Mail/pop.aliceadsl.fr/Junk
- either by providing a directory from which the script will scan all the files, including subdirectories :
$ ./unsubscribe.sh ~/.myemailsoftware/Junk/
Tip: to avoid having to find the path to the Junk file each time, create a link to this file in the script directory.
The grep
analysis of files can take some time (several tens of seconds) for a thousand spam messages. After the analysis, the script displays the progress of unsubscriptions with a dot (".") per link or a zero ("0") if the connection fails (for instance, the link may no longer be valid if it is several months old).
The output of the wget
command is added to the unsubscribe.log
file and the downloaded files are saved in the downloaded
directory. All these files will allow you to eventually identify unsubscriptions that have failed. The script leaves it up to you to clean it up if necessary.
Fields containing only an e-mail address are then detected and the e-mail addresses are simply collected in the file emails.log
. It is then up to the user to use these addresses. Be careful, a mass sending of unsubscribe e-mails could be misinterpreted by your provider and you could be automatically filtered as spammer. Finally, some e-mail addresses may be followed by a string of the type ?subject=blablablabla
which it will be up to the user to interpret.
Finally, the script displays statistics allowing you to estimate the success rate of the operation.
-h
displays help.-n
dry run. Allows you to not unsubscribe. The script displays the links found butwget
is not called.
This script will fail with a small percentage of spam because :
- some emails contain a
<mailto:>
link but no<http:>
link, - some unsubscribe pages ask you to confirm by clicking a button,
- Spam that comes from abroad does not always offer a
List-Unsubscribe
field, or sometimes the characters in the field are encoded in a way that prevents the script from finding the link.
Even if the number of emails received should be divided by at least three at first, the treatment will have to be renewed regularly. Since your e-mail address is in the possession of spammers, you risk being included in new advertising campaigns. But you will see a clear improvement in the long term.
- rfc2369
- the-ultimate-guide-to-list-unsubscribe
- wget
- The syntax of this script has been checked by the shellcheck utility.
- Bernard Desgraupes, Introduction to regular expressions; with awk, Java, Perl, PHP, Tcl... (2nd edition), Paris: Vuibert, 2008, ISBN 978-2-7117-4867-9.
The capture of links is done by the following commands:
grep ${recursif} -zPo '[Ll]ist-[Uu]nsubscribe:\s+?(?:<mailto:[^>]+?>,\s*?)?<http[s]?://[^>]+?>' "${chemin}"
| tr '\000' '\n' | grep -Po 'http[s]?://[^>]+'
- The first
grep
is responsible for detecting theList-Unsubscribe
fields (sometimes also written in lowercase). They are not required to be at the beginning of the line, this allows detection of non-standardX-List-Unsubscribe
fields. - The
-z
option replaces line breaks in the file with null bytes, which allows to get around the fact thatgrep
normally looks for patterns in every line of a file, whereasList-Unsubscribe
fields usually take up between one to three lines. - The
-P
option stands for Perl-compatible regular expressions (PCREs), which is the most complex type of regular expression handled by thegrep
command. - The
-o
option keeps only the portion corresponding to the detected pattern, instead of the entire line. \s
designates a space character, in particular space, tab, and linebreak, which are the three characters that can be encountered at that location. The+
indicates that there is at least one character. The?
indicates that this is a minimal quantizer: we reverse the regular expression engine's greed to capture as few characters as possible to the next part of the expression.(?:
means that parentheses are not used here to capture a pattern. Closing with)?
means that the presence of a<mailto:>
link at this location is optional (zero or a pattern).[^>]+?>
means looking for at least one character other than a closing "Greater-than sign" (">") before arriving at a closing "Greater-than sign" (">").- If there is a
<mailto:>
link followed by a<http:>
link, there will be a comma followed by at least one space character between them: sometimes a space if everything is on the same line, or a line break and a space or tab. - The
[s]?
(one or not s) can capture both<http:>
and<https:>
links. - The
tr
command replaces null bytes with linebreaks so that the finalgrep
can work line-by-line (no-z
for this one).
And for the http links without <>:
grep ${recursif} -zPo '[Ll]ist-[Uu]nsubscribe:\s+?http[s]?://[^>]+?\.htm[l]?' "${chemin}" | tr '\000' '\n' | grep -Po 'http[s]?://[^>]+'
grep -oPz '[Ll]ist-[Uu]nsubscribe:\s+?<mailto:[^>]+?>[^,]' "${path}"
| ? tr '000' 'grep -oP '(?<=mailto:)[^>]+' > e-mails.log
[^,]
: if the<mailto:>
link is not followed by a comma, there is no<http:>
link.- The second
grep
,(?<=mailto:)[^>]+
command means looking for characters that are different than the closing "Greater-than sign" (">") after amailto:
, which will not be captured (positive retrospective pattern).
Vincent MAGNIN, first commit: 2020-02-16
English translation made by trolologuy