Clone the repository
git clone https://github.com/Agnar22/FindWaldo.git
navigate into the project folder
cd FindWaldo
install the required libraries
python3.10 -m pip install -r requirements.txt
if everything went well, you should now be able to run the code
python3.10 Main.py
The ultimate goal of this project was to have an AI that was able to mark Waldo on the images.
The naive approach to this problem is supervised learning with conv-nets: you label a bunch of "finding Waldo" images and split the data in a test- and training set, etc. The problem with this approach is that NN works best with large amounts of data and there are a limited number of "finding Waldo" images. Additionally the time required to label all those images is substantial. Thus this does not look like a feasible solution.
Despite these problems, the approach taken in this repo was quite similar to the supervised learning approach described above. However, the data labeling and the construction of the neural network was carried out differently, and exactly how this was done is what I am going to explain to you next.
Data labeling
First and foremost, Waldo was found and cleared from 59 "finding Waldo" images. Then, by carefully extracting his head you can now easily generate a lot of fake "finding Waldo" images.
Figure 1: Left) Classic Waldo with the stripy shirt and blue jeans. Right) Waldo with extra gear
The idea is that there are always one recurring element in the "finding Waldo" images: they all contain Waldos head! This might seem obvious, but the crucial part is that the rest of him is not always shown in the images, thus it is not useful to mark his entire body. Marking his entire body could also make the neural network overfit due to the fact that in the images where you see his entire body, he is often not wearing the same clothes, as shown in Figure 1.
To avoid overfitting, his head was randomly tilted, scaled and placed on a background. By having half of the images without his head, we now have a method of generating a large dataset of labeled "finding Waldo" images without too much trouble.
Figure 2: depiction of a typical convnet architecture
The Neural Network
The problem with a standard convolutional neural network (Figure 2), in this context, is that the input must be of a fixed size. This poses a problem for us because our "finding Waldo" images are of different sizes. To solve this, the fully connected layers were converted to convolutional layers. Say whaaat!? This is all explained in this article from Stanford. The advantage of this is that the network now accepts all images that have dimentions 64x64 pixels or larger. This allows us to have 64x64 images to train on, and then be able to scale it up to larger images later on when we want the agent to really find Waldo. Thus the task of finding Waldo has now been reduced to a binary classification task for images. Whew.
When measuring training- and testing results it is important to have a clear boundary between what is considered training- and testing data. The raw images where divided into two groups where one of the groups were used to generate the training data (extracting Waldo heads and pasting them on random 64x64 cutouts of the original images, as described under Data labeling) while the other were used to demonstrate the accuracy of the agent. This separation is vital, as a leakage of the same Waldo heads from training to testing data would give a false sense of accuracy.
Here are the predictions on some of the raw images that were not used to generate training data:
Figure 3: the agent is able to find Waldo in this image
Figure 4: here it actually finds Walda, Waldo is down to the left.
Figure 5: several persons are marked here; Waldo, Walda and some of the kids.
This is not a bug, it's a feature!
Figure 6: an image with lots of Waldo look alikes and the corresponding heatmap
from the agent.
As you can see the agent is able to find Waldo or something that might resemble him, therefore I would call it a success.
- To get a real grip around convolutional neural networks, I recommend this medium article.
- I also recommend reading the article from Stanford that I reffered to earlier in this README.
This project is licensed under the MIT License.