LRAE-VC

Commands, Instructions, Setup

Setting up server, Conda, and GPUs (based on UIUC NCSA's Delta server): https://docs.google.com/document/d/1U5KpvcJr5ousA-zq9EcdzArJlSgpgM4wdYXXYV6tCLg/edit?tab=t.0

Training

To run PNC autoencoder (no classification) python autoencoder_train.py --model=PNC
To run PNC autoencoder (with classification integrated) python autoencoder_train.py --model=PNC_with_classification
To run LRAE_VC autoencoder (no classification) python autoencoder_train.py --model=LRAE_VC
To run conv_lstm_ae (whose current AE baseline is PNC16) with NO dropped out features: python autoencoder_train_vid_sequence.py --model conv_lstm_ae
To run conv_lstm_ae (whose current AE baseline is PNC16) with up to x/16 zeroed out features dropped out: python autoencoder_train_vid_sequence.py --model conv_lstm_ae --model_path conv_lstm_ae_final_weights.pth --epochs 25 --drops 12 (e.g. x = 12 here)

Testing/Inference

To run the simple + assumed to be "pretrained" classifier: ``

Sender + Receiver Test & Simulation

Run receiver first! via (below is example, change args as necessary)

python receiver_decode.py --model_path="PNC_final_w_random_drops.pth" --host=127.0.0.1 --port=8080

and wait till it says "Listening on ..."

Then run sender (below is example, change args as necessary):

python sender_encode.py --input_dir="UCF_224x224x3_PNC_FrameCorr_input_imgs/" --model_path="PNC_final_w_random_drops.pth" --host=127.0.0.1 --port=8080

TODOs

~~Implement the second and dual neural network for PREDICTING missing features in latent encodings~~
~~Implement and incorporate object classification into the autoencoder NN. How to do this? --> (https://docs.google.com/document/d/1svHaRZ1yiAsARJDC_MInBo5Ln_tRjIg14dhBlCV_UsI/edit?usp=sharing)~~
Create rate-distortion curve
Implement Tambur + FEC/ECC
Implement quantization + entropy coding (if time available)
Implement the modified, regularized loss from here: https://interdigitalinc.github.io/CompressAI/zoo.html (if time available, doubt I"m gonna do this)
[ ]

Notes + Other References:

Results sheet: https://docs.google.com/spreadsheets/d/1NVdFgHwTFBAl3Qp2PYW8EZE4UFLQ0xObCSjxnE2KDeo/edit?usp=sharing

Dev Log + Miscellaneous System CMDs:

just do ssh gpua058 in another terminal to access from another termina
nvidia-smi for GPU stats
du -sh path_to_file_or_folder to show disk space of folder or file
ls -1 /path/to/directory | wc -l to count the number of files in a directory
** YOU MUST do srun --pty bash, ssh gpuaXXX, or in general be inside the GPU VM to export env vars MASTER_ADDR and MASTER_PORT for PyTorch's DDP to work! export MASTER_ADDR=$(hostname) and export MASTER_PORT=12355 and then run w/ python your_script.py !!
Make sure to ONLY do python your_script.py when you're in the GPU VM via srun --pty bash
OTHERWISE, do srun python your_script.py if you're OUTSIDE of the GPU VM.
If you're using PyTorch's DDP: srun python -m torch.distributed.launch your_script.py
If you want to run the bidirectional models or not, just find this line model = ConvLSTM_AE(total_channels=32, hidden_channels=32, ae_model_name="PNC32", bidirectional=True or False) and set to True/False! No extra work required!

Journal: what I learned + conceptual stuff

What Is Redundancy? It means that the same critical information is stored in more than one place. In the context of an autoencoder’s latent space, redundancy means that even if some of the features (or channels) are lost or dropped, the remaining features still contain enough information to allow the decoder to reconstruct the original input accurately.

Why Might a Model Learn Redundancy with Tail Dropout but Not as Effectively with Random Interspersed Dropouts?

Tail Dropout Strategy:

Ordered Information:
When you always drop features at the tail (the last few channels), the network quickly learns that the first few channels are almost always available. It then places the most critical, base-level information into these early channels.
Incentive to Be Redundant:
Since the tail is frequently dropped, the model is forced to duplicate or shift important information into the base channels. This built-in redundancy ensures that even if the enhancement layers (the tail) are missing, the core information needed for reconstruction is still preserved.

Random Interspersed Dropout Strategy:

Uniform Randomness:
With random interspersed dropout, any channel—regardless of its position—can be dropped during training. There is no consistent pattern.
No Clear “Safe Zone”:
Because the dropout is unpredictable across all channels, the network cannot rely on a subset of channels (like the early ones) to always be present. This makes it harder for the network to learn an ordered, redundant structure where some channels are reserved as a robust base.
Distributed Representation:
The network ends up learning to spread information more evenly across all channels rather than concentrating critical details into a protected subset. While this does enforce some redundancy (since every channel must potentially compensate for a missing one), it doesn’t create a clear hierarchy of “base” versus “enhancement” features. This can make the network more vulnerable when a crucial channel is dropped, as there's no predictable backup for the information.

In Summary

Redundancy is about duplicating key information so that loss of some parts doesn’t cripple performance.
Tail dropout encourages the network to concentrate essential information into the early channels because those channels are reliably available. This promotes redundancy in a progressive, ordered manner.
Random interspersed dropout applies uniformly across all channels, forcing the network to spread information evenly rather than creating a reliable “base layer.” As a result, it may not foster the same type of redundancy where some channels are reliably preserved.

The choice between the two strategies depends on your design goals. If you want a progressive representation where some features are consistently available (mimicking a base layer in progressive transmission), tail dropout is more effective. If you aim to simulate completely random loss, interspersed dropout is closer to reality—but it might not lead to as robust an ordering of information.

Why did the original LRAE-VC (Fall 24 semester) not perform very well?

Our autoencoder wasn't trained to be integrated with the LSTM imputation (AKA feature filler), so the weights for encoding/compressing + decoding/reconstruction did not "fit" well with the weights corresponding to the LSTM imputation model. The AE and LSTM components were trained separately and just smashed together without performing some extra "post"-training to ensure they "fit" and integrate smoothly.

Why did autoencoder_post_drop_train/eval2.py (which attempts to alleviate the above-mentioned problem) fail?

Though autoencoder_post_drop_train/eval2.py does perform some post-training (following normal AE and LSTM trining) to try to get the AE and LSTM component to integrate smoothly, it still failed pretty spectacularly for-what I suspect are-the following reasons:

The pipeline involved loading the AE (e.g. PNC16) model --> effectively "freezing" the encoder (via model.eval() and with torch.no_grad()) and detaching it from the AE model --> encoder encodes the latents and stores/caches them into a global dictionary (defaultdict) --> decoder uses that global dict of latents to reconstruct the sequences of video frames --> backpropagate gradients + update weights OF JUST THE DECODER (NOT THE ENCODER)! The last step was mostly a big problem. Although, this pipeline is still way to "fragmented" despite the attempt at "post" training.
I was training everything from scratch, including the dropouts, imputations, etc. after each epoch. It's much better to pretrain without incorporating any wacky constraints + conditions, and then progressively add dropouts, imputations, etc. in the after-pretraining training stage

EDIT: (related to 2.) I just realized there's a bug in my code, particularly for if I set training_from_scratch=True. When I train from scratch, the encoder's initial weights will obviously be completely off, but then I never update the encoder's initial due to freezing/detaching the encoder LOL (via ae_model.eval() and torch.no_grad())

Why do I not use autoencoder_post_drop_train/eval.py anymore?

Though this was of course written before the 2nd version, there was an oversight when I wrote this. Namely, this implementation passed in the FULL (ground truth) combined video features. This obviously doesn't make sense because in a real scenario you won't have access to the full, combined video features, especially during network congestion + packet loss

After realizing this I decided to implement my pipeline + forward pass as encoder -> lstm -> decoder,

Why did the original encoder -> lstm -> decoder version fail?

lstm flattened dimensions to 1D, making it harder for the model to work with and train from. My soln: use 2D ConvLSTM

Pretraining first (without any extra stuff like dropouts + imputations) and then training with fancy stuff (e.g. dropouts) is MUCH better than trying to train everything, including dropouts after every epoch, at once.

This turned out to be crucial for success

Does ConvLSTM process each frame independently?

No not exactly. The ConvLSTM processes frames sequentially—each frame’s hidden state is updated using the previous hidden state, so it carries past context. However, once the sequence is processed, we often reshape the output (merging the batch and time dimensions) to apply a convolutional mapping to each time step individually. That mapping operates on each frame separately, but each frame’s representation already includes the temporal history learned by the LSTM.

why does ConvLSTM "flatten" dimension 0 into batch_size * time?

The LSTM treats each batch element (i.e., each video or sequence) independently. The batch dimension is simply a collection of separate sequences processed in parallel. When you merge the batch and time dimensions later, you're not mixing frames from different videos; you're just flattening the tensor to apply the same convolutional operation to every time step's output across all batches. Each frame's representation already contains its own temporal context from its respective sequence.

Name		Name	Last commit message	Last commit date
Latest commit History 232 Commits
TUCF_sports_action_224x224_mp4_vids		TUCF_sports_action_224x224_mp4_vids
UCF_224x224x3_PNC_FrameCorr_input_imgs		UCF_224x224x3_PNC_FrameCorr_input_imgs
UCF_uncompressed_video_img_frames		UCF_uncompressed_video_img_frames
playing_around_py		playing_around_py
utils		utils
PNC32_final_w_taildrops.pth		PNC32_final_w_taildrops.pth
README.md		README.md
autoencoder_train.py		autoencoder_train.py
autoencoder_train_vid_sequence.py		autoencoder_train_vid_sequence.py
conv_lstm_PNC16_ae_dropUpTo_16_features_final_weights.pth		conv_lstm_PNC16_ae_dropUpTo_16_features_final_weights.pth
conv_lstm_PNC16_ae_final_pristine_pretrained.pth		conv_lstm_PNC16_ae_final_pristine_pretrained.pth
deprecated.py		deprecated.py
feature_filling_lstm_train.py		feature_filling_lstm_train.py
get_features.py		get_features.py
models.py		models.py
receiver_decode.py		receiver_decode.py
sender_encode.py		sender_encode.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LRAE-VC

Commands, Instructions, Setup

Training

Testing/Inference

Sender + Receiver Test & Simulation

TODOs

Notes + Other References:

Dev Log + Miscellaneous System CMDs:

Journal: what I learned + conceptual stuff

Why Might a Model Learn Redundancy with Tail Dropout but Not as Effectively with Random Interspersed Dropouts?

Why did the original LRAE-VC (Fall 24 semester) not perform very well?

Why did autoencoder_post_drop_train/eval2.py (which attempts to alleviate the above-mentioned problem) fail?

Why do I not use autoencoder_post_drop_train/eval.py anymore?

Why did the original encoder -> lstm -> decoder version fail?

Pretraining first (without any extra stuff like dropouts + imputations) and then training with fancy stuff (e.g. dropouts) is MUCH better than trying to train everything, including dropouts after every epoch, at once.

Does ConvLSTM process each frame independently?

why does ConvLSTM "flatten" dimension 0 into batch_size * time?

About

Releases

Packages

Contributors 2

Languages

johnli25/LRAE-VC

Folders and files

Latest commit

History

Repository files navigation

LRAE-VC

Commands, Instructions, Setup

Training

Testing/Inference

Sender + Receiver Test & Simulation

TODOs

Notes + Other References:

Dev Log + Miscellaneous System CMDs:

Journal: what I learned + conceptual stuff

Why Might a Model Learn Redundancy with Tail Dropout but Not as Effectively with Random Interspersed Dropouts?

Why did the original LRAE-VC (Fall 24 semester) not perform very well?

Why did autoencoder_post_drop_train/eval2.py (which attempts to alleviate the above-mentioned problem) fail?

Why do I not use autoencoder_post_drop_train/eval.py anymore?

Why did the original encoder -> lstm -> decoder version fail?

Pretraining first (without any extra stuff like dropouts + imputations) and then training with fancy stuff (e.g. dropouts) is MUCH better than trying to train everything, including dropouts after every epoch, at once.

Does ConvLSTM process each frame independently?

why does ConvLSTM "flatten" dimension 0 into batch_size * time?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages