Skip to content

Latest commit

 

History

History
62 lines (43 loc) · 12.2 KB

README.md

File metadata and controls

62 lines (43 loc) · 12.2 KB

Mario Kart DS Reinforcement Learning Program Summary

This project was to make a DRL agent that learned to play the Figure-8 Circuit Course of Mario Kart DS (chosen partly because of childhood nostalgia and because of its not-too-sharp-turns and bridges that would have it utilize both screens)

It uses a DDQN Network, gets image inputs by taking a screenshot of the approximate area of the DS emualtor (fullscreen), and is rewarded based on a Lua Script that reads certain RAM adresses (speed, angle, and checkpoint) and determines speed and direction from them

It performs fairly well, being able to get to the final lap around 25% of the time by the 800th episode (out of 2000)

Training still needs to be fully completed, but from the scores in the 800th episode, it seems that it will perform well

Jump to:

Prerequiste programs

How to make it work

How it works

Sample Results

How I made it

Required Programs and Links to Get Them

How to get it Working

(This guide is for getting it to work on a Windows computer. Getting it to work on Mac or Linux is most likely the same but it could possibly be slightly different

  1. Put DeSmuMe and the LuaScript compiler all in the same directory (I put both of them in a folder on my Desktop).
  2. Launch DeSmuMe, then go to File (in the top left, like most programs), then go to “Open ROM”, and click on the zipped folder of your Mario Kart DS ROM. Select Single-Player Mode on the title screen of the game.
  3. Back To DeSmuMe tinkering, go to Tools, then to Lua Scripting, and a window should pop up. From there click “Browse”, and go to the Lua Script file that you downloaded from this repository (doesn’t matter what directory the script file is in, just the compiler). If it is not working you might consider building Lua 5.1 on your PC (lua-users.org/wiki/BuildingLuaInWindowsForNewbies)
  4. Now, start the Python Code on PyCharm (what I used) or any other Python IDE that suits you best. Wait for it to display the summary of both the Q-networks. Make sure you have all libraries that the code used installed.
  5. Go back to Mario Kart and select the Vs. mode. You can select any character you want (I went with R.O.B. because he is a robot, heavy, And with a high speed stat, mostly because he’s a robot, but you have to unlock him through playing the game yourself and beating Special Cup on Mirror Mode, so if that’s not your style, you can pick from 8 other characters, including Toad, Yoshi, Luigi, and Donkey Kong), but make sure to set it to 150cc (to make training quicker) and on easy difficulty (to make training more forgiving) in the nest screen that follows after you select your character. Press “Okay”.
  6. Once you get the list of courses pick the Mushroom icon then select Figure 8 Circuit, then press “Okay”. From this point on you don’t need to do anything! You can just watch it train! You might want to use the PreventTimeOut program in the repository as well depending on your computer’s resources. I picked Figure-8 Circuit as a Track because it is an track with no sharp turns that has some fun bridges where track intersects each other to see if the network can truly learn how to play every Mario Kart track (just with it trained separately with different Lua Script file for different angle detection)

Back to Top

How it Works

(Also detailed in the code files, but here is a compilation of all the explanations I give. More in-depth ones for each program can be found on their respective code files) The network is a dynamic convolutional neural network that takes in 6 frames of screenshots covering roughly the area of the DS frame (both top and bottom, used to be just bottom then it was reworked to be both in fear of it not knowing what to do when parts of the track intersect on the minimap) in DeSmuMe and then chooses one out of 5 actions to take (turning left, turning right, going straight, and the former three with the addition of using an item, all of which controlled by keyboard) using a Conv3D architecture (Conv2D was too computer-intensive and did not learn too much), getting a reward based on direction and speed. When I first made the program, I also gave it the capability to drift, but I scrapped that very early, as it was making the robot too chaotic. However, I added it back in the final code as a possible option to enable if you want to see what happens. The Lua Script is the main rewarder of the bot, as it makes a box with certain g and b values (RGB color) depending on speed and direction (defined relative to checkpoint it most recently passed), both values it gets by reading the memory of the game, something the Python code cannot do, on the top left side of the bottom screen, which the Python program reads the RGB values from and gets the reward. The higher speed, the higher the absolute value of the reward, but whether it’s going in the right direction or not determines the sign of that reward (wrong direction is negative reward, right direction is positive reward). The network follows a DDQN training algorithm, with some extra things I had to put in there. Namely, I had to make it where it only remembers the last 10 races because the rate of training would slow massively if I didn’t do so (my GPU is bad), and I had it forget the last two sets of frames, which were between the time it finished and the time that the program says its finished. Additionally, it forgets fairly quickly any races where it thinks it's racing although it isn’t even on the race screen. Although, enough time when it thinks its racing although it isn’t can still seriously detriment or completely impair your program from learning the right thing. Mario Kart DS’s DNF system was very helpful, as that would tell the bot the race was over without it actually being able TO finish the race, so it doesn’t spend an eternity in one race. The network only starts training once the race finishes, and then when it finishes training on that batch for an episode, it goes to the next race on its own through a series of keyboard presses.

Back to Top

Expected Results

The training video shown was at episode 804, where it went to the final lap. This does show its progress, as trackwise, by the 600th/700th episode (out of 2000) your bot should be consistently clearing the first lap before it gets DNFed. It takes it a while to learn how to do it, but don’t fret!. It does eventually learn.

Some examples of results at 200th and 800th episode (out of 2000):

Back to Top

Design Process

The DDQN Mario Kart Network went through multiple major iterations in both network architecture and reward function, so I felt it right to highlight them here:
The first iteration of the Figure-8 Circuit network was a 2 dimensional DQN convolutional network that took a single frame and predicted the reward solely from that frame alone. As expected, this didn’t work, but I used this mainly to test if everything else in my code was working, and if the network would actually converge to something, which it did (just turning the same direction the whole time). The reward was based solely on place, which is something I ended up ditching later on. When I knew everything else was working as intended, I started delving deeper into improving the network and the reward function.
The second iteration of the Figure-8 Circuit network was a Conv2D DDQN convolutional network with it taking a series of 4 frames as input as opposed to a single frame. Again, this didn’t work, and at this point I was considering reading the memory of the game to get direction checking to improve the reward function, as it was solely based on place (8 - place), which wasn’t giving the network enough data to train properly (I deleted all data when it was stuck in 8th place and didn’t do anything, which was something I didn’t do the first iteration as well). The first Dense after the Flatten was very small to save on resources (which is something I changed when transitioning into 3D Convolutional Network). As with last time, it converges to just turning one direction for the entire race without any signs of improvement between the beginning and when that happens.
At this point, I knew I would need to include something about detecting whether it was going the right direction or not. I narrowed down my options to reading the memory of the game and making a different convolutional network that would detect whether the bot was going in the right direction or not. I decided to read the memory of the game, as I thought it would be less expensive computationally and it was definitely the easier to implement option.
The third iteration of the Figure-8 Circuit Network was a Conv2D DDQN that took a series of 4 frames with reward based on both place and passing a checkpoint. Again, the big problem I was having with both of these rewards was that they were sparse, with checkpoint rewards happening only when you passed a checkpoint (positive for passing it the right way, negative for passing it the wrong way, thus being an indicator of if the bot is going in the wrong direction or not). Minor changes I made during this iteration were the different things the bot remembered after each race and experimenting with the batch size for resource saving but where it still can “learn”. It still converged the same way it did with the last two iterations.
The fourth iteration of the Figure-8 Circuit Network was a Conv3D DDQN that took a series of 6 frames with reward based on place, checkpoint, and speed (rewards taken individually summed to one cumulative reward). I changed the amount of frames to 6 because I discovered that with Conv3D Network I could also use MaxPool3D, which made me able to increase batch size, frame amount, and make the Dense after the Flatten bigger than it was initially (from 4 to 16) because of the free resources I now had with using MaxPool3D. In the iteration I changed the reward function by excluding certain bases for the reward (I had scratch files in my Pycharm project for discluding direction, discluding speed, and discluding place). Discluding place seemed to work the best, which made sense as place was the most confusing reward. However, it still eventually converged to turning one direction the entire time, although there were some signs of slight improvement in between.
The fifth and final iteration (so far) of the Figure-8 Circuit Network is a Conv3D DDQN that took a series of 5 frames and rewarded based on speed and direction. This reward was different from the previous reward function based on speed and direction in that the direction award was dynamic (based on angle relative to the angle that you would pass the checkpoint if you were going straight through it, look at Lua code if you want a better explanation. This is going from the forward direction, and there isn’t a big angle change between any two checkpoints, as the checkpoint frequency increases the more curved the road is). Additionally, the direction and speed rewards were combined to where direction would determine the sign of the reward and speed would determine the magnitude of the reward. This worked wonders compared to all the other iterations, producing the scores in the documentation and the training videos in the training video folder. The rest of the changes I made to the code were for efficiency and ability to save weights midway through training, but not to the actual network and reward function.

Back to Top