A Python implementation of Q-Learning algorithm to train an AI agent that plays Tic-Tac-Toe. This project demonstrates the application of reinforcement learning techniques to a classic game environment.
Author: Nick Vaughn (@EngineerNV)
- Project Overview
- Objectives
- Tech Stack
- How It Works
- Project Structure
- Installation & Setup
- Usage
- Performance Metrics
- Q-Learning Algorithm
- Future Improvements
This repository showcases a reinforcement learning implementation where an AI agent learns to play Tic-Tac-Toe through self-play against a random opponent. The project serves as an educational tutorial demonstrating how to apply Q-Learning algorithms to games with multiple states that cannot be easily cataloged with simple conditional logic.
The AI learns optimal gameplay strategies by exploring different board states and actions, gradually improving its performance through trial and error over thousands of training episodes.
- Demonstrate Q-Learning Implementation: Apply reinforcement learning theory to a practical game scenario
- State Space Management: Handle complex game states using dictionary-based Q-table storage
- Educational Resource: Provide a clear, understandable example of RL techniques for learning purposes
- Performance Optimization: Train an AI agent that can consistently win or tie against random opponents
- Language: Python 3.x
- Core Libraries:
random- For exploration during training and random opponent movesjson- For persisting learned Q-values between sessionssys- For command-line argument parsing
- Algorithm: Q-Learning (Reinforcement Learning)
- Data Structure: Dictionary-based Q-table for state-action value storage
No external dependencies required - uses only Python standard library!
The Tic-Tac-Toe board is represented as a 3x3 grid with the following conventions:
0= Empty cell1= Player O (Random opponent)2= Player X (AI agent)
Board positions are indexed 0-8 in linear format:
0 | 1 | 2
-----------
3 | 4 | 5
-----------
6 | 7 | 8
- State Representation: Each unique board configuration is converted to a string key (e.g., "012000000")
- Action Selection: For each state, the agent can take actions corresponding to available empty cells
- Exploration vs Exploitation: During training, there's a 25% chance of making random moves (exploration) vs choosing the best known move (exploitation)
- Learning Updates: After each move, Q-values are updated based on rewards and future state predictions
For each training episode:
1. Clear the board
2. Random opponent makes first move
3. While game not over:
- AI observes current state
- AI selects action (75% Q-learning, 25% random exploration)
- Opponent makes random move
- Update Q-table based on reward and next state
4. Repeat for thousands of episodes
TicTacToe_RL/
โ
โโโ main.py # Main entry point - CLI interface for training/testing/playing
โโโ ticTac.py # Tic-Tac-Toe game logic and mechanics
โโโ qLearning.py # Q-Learning algorithm implementation
โโโ table.json # Saved Q-table (trained model weights)
โโโ winStatsTicTac.PNG # Performance visualization
โโโ .gitignore # Git ignore patterns
โโโ README.md # This file
Command-line interface providing three modes:
- Play Mode: Human vs trained AI
- Train Mode: Train the AI agent
- Test Mode: Evaluate AI performance over multiple games
Game engine containing:
- Board state management
- Win condition checking
- Move validation
- Reward calculation for reinforcement learning
- Board-to-state conversion utilities
Q-Learning agent implementation:
- Q-table initialization and management
- Learning algorithm (Q-value updates)
- Action selection (exploration/exploitation)
- Model persistence (save/load Q-table)
- Training and testing loops
Serialized Q-table containing learned state-action values from training. This file stores the "knowledge" the AI has acquired about which moves are good in different board states.
- Python 3.x (tested with Python 3.12.3)
-
Clone the repository:
git clone https://github.com/EngineerNV/TicTacToe_RL.git cd TicTacToe_RL -
No additional dependencies needed! The project uses only Python standard library.
Test your skills against the trained AI:
python3 main.py -p table.jsonYou play as 'O' and go first. Enter row (0-2) and column (0-2) when prompted.
Example gameplay:
Choose Row 0-2: 1
Choose Col 0-2: 1
Train the AI from scratch or improve existing weights:
python3 main.py -train table.json <episodes> <alpha> <gamma>Parameters:
table.json: Filename to save trained weightsepisodes: Number of training games (e.g., 10000)alpha: Learning rate (0-1), recommended: 0.6gamma: Discount factor (0-1), recommended: 0.8
Example:
python3 main.py -train table.json 10000 0.6 0.8Training Tips:
- Start with 10,000 episodes for basic competency
- Use 50,000+ episodes for optimal performance
- Higher
gamma(closer to 1.0) makes the AI value future rewards more - Higher
alpha(closer to 1.0) makes learning faster but potentially less stable
Evaluate the trained AI over multiple games:
python3 main.py -test table.json <num_games>Example:
python3 main.py -test table.json 100Output:
Tic Tac Stats ---> 100 games
Win percent: 56.0%
Tie percent: 44.0%
Lose percent: 0.0%
The trained AI (included table.json) demonstrates strong performance against random opponents:
- Win Rate: ~56%
- Tie Rate: ~44%
- Loss Rate: ~0%
- Undefeated Rate: 100% (no losses)
These metrics show that the AI has learned to:
- Never lose against random play
- Win more than half the time when possible
- Force ties in situations where winning isn't achievable
Q-Learning is a model-free reinforcement learning algorithm that learns the value of actions in different states. The "Q" stands for quality, representing how good an action is in a given state.
Q(s, a) = Q(s, a) ร (1 - ฮฑ) + ฮฑ ร [R(s', a) + ฮณ ร max(Q(s', a'))]
Where:
Q(s, a)= Quality value of actionain statesฮฑ(alpha) = Learning rate (0.6 in this implementation)ฮณ(gamma) = Discount factor (0.8 in this implementation)R(s', a)= Reward received after taking actionas'= Next state after actionmax(Q(s', a'))= Maximum Q-value of all possible actions in next state
The reward system incentivizes winning and penalizes losing:
- Win: +10 points
- Tie: +5 points
- Loss: -40 points
- Non-terminal state: 0 points
This asymmetric reward structure heavily penalizes losses, encouraging conservative play.
- State Space: Each unique board configuration is a state (maximum 3^9 = 19,683 possible states)
- Action Space: 9 possible actions per state (one for each board cell)
- Exploration Rate: 25% chance of random action during training
- Q-Table Storage: Dictionary mapping state strings to action value arrays
Potential enhancements to expand this project:
-
Advanced Opponents
- Train against minimax algorithm
- Self-play training (AI vs AI)
- Human gameplay data integration
-
Algorithm Enhancements
- Implement Deep Q-Learning (DQN) with neural networks
- Add experience replay for better learning
- Implement double Q-learning to reduce overestimation
-
Features & Analysis
- Visualization of Q-values during gameplay
- Board state heatmaps showing learned preferences
- Training progress graphs
- Interactive web interface
-
Generalization
- Extend to other board games (Connect Four, Checkers)
- Variable board sizes
- Multiple AI difficulty levels
-
Performance Optimization
- Parallel training using multiprocessing
- Optimized state representation
- Pruning impossible/redundant states
This project is open source and available for educational purposes.
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.
Nick Vaughn - @EngineerNV
Project Link: https://github.com/EngineerNV/TicTacToe_RL
- Inspired by classic reinforcement learning tutorials
- Q-Learning algorithm based on Watkins & Dayan (1992)
- Created as an educational resource for RL enthusiasts
This project demonstrates practical application of reinforcement learning principles in a simple, understandable context. Perfect for portfolios, learning, and teaching RL concepts!