This repository contains the code written to support the research for my thesis Finding Latent Features in Internet Censorship Data. The thesis was further refined and subsequently published as Detecting Network-based Internet Censorship via Latent Feature Representation Learning and a preprint is available at arxiv.org
The machine learning models are build with Pytorch extended by PytorchLightning. Logging was set up to use Comet. If you wish to use a different logger that can easily be swapped in your instance of this code.
The Censored Planet data needs to be transformed into datasets that can be used with our models. I built my base dataset
by ingesting one large CP_Quack-echo-YYYY-MM-DD-HH-MM-SS.tar
file at a time to accommodate the speed and stability of
my computing environment. My data was taken from 5 days in the summer of 2021. I believe that the structure of their
data as since changed so you likely will need to refactor cp_flatten.py
if you are using newer data.
flowchart TD
A[/Quack tar file/] -->B(cp_flatten_processor.py)
B --> C[/Pickled Dictionary<br>Stored at indexed path/]
C -- iterate --> B
B --> D[/Create or update<br>metadata.pyc/]
D --> E(Single tar file processed)
The flattened and vectorized data is stored as pickled dictionaries using an indexed directory structure under the specified output directory
flowchart TD
A[Dataset dir] --- 0
A --- 1
A --- 2
A --- B[...]
A --- m
A --- i[/metadata.pyc/]
2 --- 2-0[0]
2 --- 2-1[1]
2 --- 2-2[2]
2 --- 2-c[...]
2 --- 2-99[99]
2-2 --- 220[/202000.pyc/]
2-2 --- 221[/202001.pyc/]
2-2 --- 222[/202002.pyc/]
2-2 --- 22c[/.../]
2-2 --- 229[/202999.pyc/]
These dictionary files are used in the remainder of the project via QuackIterableDataset
found in
cp_dataset.py
. This iterable dataset is managed using QuackTokenizedDataModule
.
For the image based model, this data is accessed via QuackTokenizedDataModule
and stored in a two new datasets by
cp_image_reprocessor.py
using a similar directory tree in which each leaf directory stores a PNG image file and a
pickle file of the encoded pixels and metadata. The first image dataset is balanced between censored and uncensored for
training the replacement classifier layer in DenseNet. The second set are all the undetermined records.
The flattened and tokenized data is used to train the autoencoder
flowchart TD
A[QuackIterableDataset] --> B[QuackTokenizedDataModule]
B --> C(ae_processor.py)
C --iterate--> B
C --> D[trained QuackAutoEncoder]
The trained autoencoder model is captured and used as in additional input to ae_processor.py
to process the data
into two sets of embeddings. One set is labeled and balanced between censored and uncensored for training the
classifier. The second set are embeddings of the undetermined records.
flowchart TD
J[/trained QuackAutoEncoder/] --> M
K[QuackIterableDataset] --> L[QuackTokenizedDataModule]
L --> M
M(ae_processor.py) --> N[AutoencoderWriter]
N --> O[/.pyc file in indexed directory/]
These two datasets of embeddings are managed with QuackLatentDataModule
.
Classification as censored or uncensored is the core task of this work. There are two classification processes
built in this repository. latent_processor.py
both trains a QuackLatentClassifier
using a set of labeled embeddings
and uses a trained QuackLatentClassifier
to classify undetermined embedding as either censored or uncensored.
dn_processor.py
both trains a QuackDenseNet
using a labeled set of image data and then uses the trained
QuackDenseNet
to classify undetermined image data as either censored or uncensored.
Our data processed in the CUNY HPCC which uses SLURM to manage jobs. Figuring out how to configure for SLURM was a challenge. An additional challenge was that Pytorch no longer supported the older GPUs we had available, so we needed to train in parallel on CPU. I eventually solved parallel processing on that architecture by using the Ray parallel plugin. These job scripts also contain setup for this plugin. I've left them here as I had trouble finding examples. Your computing environment is almost certainly different and that will cause further changes in your instance of this code.
This documentation is presented in markdown that
was generated from the docstrings within each python module.
It may be found in the docs
directory here in the repository.