Skip to content

WaitHoldMyBeer/Q_learning_maze

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Q-Learning Parallels to Behavior: The Quest for an Accurate State Representation

Abstract:

This paper is directed by the question “To what extent does Q-learning imitate reinforcement learning in rats?” Using computational modeling with empirical validation of an externally sourced rat data set, and cross-validated parameter optimization using Bayesian optimization, this paper recognizes the importance of an accurate state representation when training a model. The task at hand on which the Q learning model is trained and evaluated is a double T maze with reversing goals, created by the Blair Lab at UCLA. The experimental process this paper follows is first, building the digital environment of the model, training it on a Q Learning model, and optimizing hyperparameters of the Q learning model with Optuna’s Bayesian optimizer. The resulting observations highlight the importance of choosing an optimal state representation to improve the interpretation of the learning mechanisms underlying biological experiments.

Rat Data Collection

Rat data was supplied by the UCLA Department of Psychology. The data used in this study was stored as a CSV (comma-separated values) file, including the session, trial, state, turn (action), start arm, and reward phase. The experiment itself is a double-T maze task (Fig 1A). The maze is shaped as a plus with a square perimeter, with possible reward locations in each corner, which diagonally protrude slightly from the perimeter of the maze at said corners. Door configurations (Appendix 1) ensure the rat must make either a left or right turn at every intersection. The rat can start either in the north or south start arm (Fig 1B, C), and begin facing towards the door further from the center, to prevent its feet from catching in the door. From there, it must find its way to the goal corner, (the north-east corner), and receive sweetened condensed milk as a reward. Then, the doors reconfigure to guide the rat to the next randomly chosen start location. The next start location was selected pseudo-randomly (no more than 3 consecutive times).

Sessions lasted either 40 minutes or ended when the rat had completed 32 trials. One trial consists of a start, reaching the goal, and returning to the next start arm, after which the next trial begins. Rats were considered proficient after getting 70 percent of the first 2 turns correct, two sessions in a row. A further experiment was conducted on the rats afterward, which this paper is unconcerned with, except for the reversal session, which happened 3 sessions later. For example, if the rat was deemed proficient in session 8, then session 12 was a reversal session. One reversal session began with 16 normal trials, and on the 17th trial, the sweetened condensed milk was mechanically relocated to the southwest corner.

image

Digital Environment Modeling:

This experiment is was modeled in Python. The final environment used in this model is a Markov Decision Process, but the one employed consisted only of turn decisions at intersections, without an about-face and forward movement. Looking for biological inspiration, the pretraining in the biological UCLA study was modeled but didn’t work because pre-seeding Q-values in such a sparse environment meant extreme subjectivity to hyperparameters, and therefore wouldn’t be optimizable.The concluding environment was one where an action made in one state was mapped to another state so that the agent choosing one state meant it ended up in another. The Pygame library was used for a visual representation, enabling bug decoding, easier manipulability, and tracking. States were labeled according to Figure 1A, and actions were mapped accordingly. The rat cannot enter the very center, or midpoints along the perimeter of the maze. This limits the number of actions the rat has to learn per state. Actions were labeled forward, left, right, and about-face, respectively along the list [0,1,2,3] in the program.

A reward was given at the goal corner, and every other state was given a punishment reward of -1. Like real life, the rat was punished solely for moving, to increase efficiency. However, this process should already reveal a large part of that argument for an accurate state representation.

For example, turning 45 degrees. While a rat is free to do so, place cells would generally encode intersections as choice points with three options: moving forward, turning left, and turning right. This study does not claim to have found the best state representation. Further state masking would have been preferable in hindsight, but instead stresses the importance of accurate state representation in this context.

Optimization:

Rather than measuring correctness, the Bayesian Optimizer, run on Optuna, optimized the percentage of actions taken in a given state which were identical to the rats, cumulatively. For example, if the rat made the first about-face turn and the model did as well, and then the rat chose left and the model chose right on the second turn, the scoring of that trial so far would be 50%.

Ten rats were used for training data and two for validation. Acquisition and reversal trials were used in one optimization session, and solely acquisition was used in another.

image

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages