This repo contains my attempt to make an autoencoder for compresing frames from a vr game(specifically ChilloutVR)
- evidence_images: comparisons between algorithims and a script to generate them.
- logs: tensorboard compatible logs of some trainings
- model_checkpoints: saved weights for models
- Model<letter>: The model.
- <model-name>.data: The checkpoint.
- <model-name>.index: Also part of the checkpoint.
- model: The model's set-up.
- Model<letter>: The model.
- model_to_load: Put a checkpoint here to load it
- processed: (not here) Load dataset from here
- results: (not here) Generated result images
- autoencode.py: The main file, trains and validates the model.
- dataset.zip: A subsection of the dataset, smaller resolution.
- image_preprocess.py: Just halves the resolution of images, used to make dataset.
- notes.md: Some observations.
it's here
Admitidly, most of the following is from my paper for my Principals of Machine Learning.
Virtual Reality (VR) is a well-known technology that is still actively being worked on. In particular are constraints around processing, since VR applications need to render two high-resolution images from slightly different perspectives to create depth at a speed that does not feel “choppy,” causing motion sickness. At the same time, physical constraints either require the user to tether their headset to their computer or to put a mobile computer on the headset itself. Some users, however, find a compromise in wireless streaming applications like virtual desktop or Oculus AirLink. These applications allow the user to have a tether-less experience while also having a full-fledged computer to run the VR application. Often an issue is transferring the high resolution, that is, large file size,application frames over an at times unoptimal wireless network. Compression is often used to reduce the file-size so less has to be transferred over the network, and so transfers per image are faster, assuming that compression and decompression times are not larger than the gained transfer speedup. Image compression comes in two common forms, lossy and lossless. Typically, lossy algorithms produce smaller files at the cost of some artifacting.
Machine learning (ML) approaches to image compression are uncommon and yet surprisingly effective. The benefit ML approaches have is that they can learn how best to compress images in ways that may not be intuitive for a human to understand. Additionally, they can be trained to work in specific environments with specific data, allowing them to work better than approaches that are vastly more general. The model used in this paper is a Convolutional Neural Network (CNN). This type of Neural Network works well on image data.
The goal of this research is to create an ML model that can compress images specifically for wireless VR streaming applications. This requires that models are simultaneously, visually accurate, well compressed, and are compressed and decompressed quickly. For this, I used a CNN set up as an auto-encoder. Auto-encoders are Neural Networks(NN) that reduce the size of the data moving through them toward the middle and expand out again to the original size of the output to force the model to generalize. This shaping is good for compression since it will cause the model to train both shrinking the file and also unshrinking it in the same model. The model then only needs to focus on maintaining the original data at the output.
I just recorded 2 hours of chilloutvr play. I used SteamVR's view of both eyes to propperly emulate the user's view. Each frame was saved as a 1920x1080 pixel frame with 3 channels, saved to png format to avoid lossy compression. The largest problem with this data was having way too much. I recoreded at 5 frames per second and still ended up with 40Gb of data. This was after editiing to remove frames that might have issues, namely those containing my desktop overlay. I then halved the resolution to 960x540 to make it faster to train the model(faster mainly).
I made several models, each model tuning to better balance between compression, time and quality. The most important statistic is encode/decode time as latency is one of the biggest causes of motion sickness and general discomfort in wireless setups. Quality is secondary as images that are recognisable enough are more important the recreating detail exactly. The overall picture is more important than small details in most cases, though text specifically is an important detail since it is used to convey much application information to the user. Last is size, since any reduction is useful and the user can be asked to select
Models were built off of a tutorial from the keras developers. The first model compressed images to a tiny size(4% of the original image) but was significantly worse at image accuracy and quality. The second model was set up to not compress as small, but struggled with text as it had few filters to work with. The third was given more filters and is considered the best model. The fourth model used a different loss function but ultimately took longer to train with no discernible benefit. All models took roughly the same time with encode/decode(10ms full trip).
Figure 1: Best Model Architecture
The best model was set up as seen in figure 1. The encoder uses convolutional layers that have progressively fewer filters and progressively smaller kernels. Each filter is a matrix that slides over each pixel in the input image, creating a new image with the same dimensions that has been transformed. Between some layers are max pooling layers that take the max value of a sliding window The decoder mirrors the encoder using convolutional transpose layers and upsampling to undo the pooling and convolutional layers. However, it does not undo the encoder's final max pooling layer, and ends with a final convolutional transpose layer(3 filters[one per channel], 5x5 kernel size, sigmoid activation).
Models were set up using tensorflow and keras. Models were saved at regular intervals during training with tensorflow's checkpointing system. As well, logs were kept using tensorflow's tensorboard callback, allowing me to look at training progress graphically. Matplotlib allowed for the creation of plots to save and compare images from the dataset before and after encode/decode.
The following results are for the best model, which was described in Figure 1. Compression ratio describes how much the image is compressed as a ratio of the original image's size. A ratio of 0.25 means the compressed image is 1/4th the size of the original. The best model compresses a 12441.6 kb image (5409603 tensor of 32-bit floats) to 2073.6 kb (1352408 tensor of 32-bit floats) for a compression ratio of 0.1666. That's 1/6th the original size.
The model used MSE to calculate loss as it was the best loss function I was aware of at the time. During training, the MSE went from 0.0026 initially to 0.0009 at the end of training. This means the model is roughly 34% better than just scaling down and scaling up the image (which is what an untrained model is effectively doing). The validation MSE was 0.0006. While validation loss was lower than training loss, validation loss varied wildly per epoch.
The model takes on average 12.1 ms to inference through the whole model, making encode and decode roughly 6.05 ms each. For comparison, Oren Rippel and Lubomir Bourdev found in their paper, Real-Time Adaptive Image Compression, that JPEG takes 18.6ms to encode and 13.0ms to decode and that WebP takes 67.0ms to encode and 83.7 ms to decode. For an application running at 60 frames per second, the application needs to generate a frame every 16.6ms so jpeg and webp cause significant latency per frame.
Below is a comparison of output from jpeg, webp, and the ai. All three algorithms are compressed to the same filesize (2074Kb).
The model is not perfect, its major issues being compression ratio and image quality. It does not provide the quality that JPEG and WebP can provide as smaller file-sizes. Potential solutions include restructuring the layers of the model, implementing some non-convolutional layers, or using a Generative-Adversarial approach. The model's compression ratio is too dependent on the max pooling layers. A fix for this is to add some non-convolutional layers where varying the number of neurons changes the compression ratio. As well, some models should be made with smaller compression ratios to provide better compression.
The model needs a better metric to train with. It could be trained with Peak Signal to Noise Ratio (PSNR), Structural Similarity Index (SSIM), Perceptual Similarity Metric (LPIPS). Each of these loss functions does a better job of measuring the depredation of image quality. Training with these could provide a better model visually and could improve training speed.
Using prior recorded times from another paper is not the best method for comparison admittedly. The encode and decode times for the model should be measured separately as well as recording the encode and decode time of JPEG and WebP for comparable compression ratios. This would provide a more accurate comparison between all algorithms.
Once a better model has been created, it would be ideal to include this model in an application. This would allow measuring of user preference and in application performance.
For users of wireless VR solutions, an AI compression algorithm would be useful. I attempted to create such a model using an autoencoder based CNN. While similar approaches exist, none exist that are application-specific or specific to wireless AI solutions. It was fast but not as good at compression or as good visually as JPEG or WebP at comparable compression ratios. The model should be improved in various ways before it is put into any application.