This repository accompanies this blog post that I wrote.
I recreated the AlphaGo / AlphaZero learning paradigm in a tiny Tic-Tac-Toe environment with no human data and no labels. It's powered with raw self-play, Monte Carlo Tree Search, and a neural net learning from scratch how to play the game.
This project was inspired by the beauty of agents learning through trial and error (and because I never made solid RL project before).
- A self-play RL agent that learns to play Tic-Tac-Toe from scratch
- Combined Monte Carlo Tree Search with a policy + value network
- Trained using a loop of self-play → replay buffer → network updates
- Evaluated against random agents, minimax bots, and humans
- Built a CLI game so you can play against it yourself
| Component | Description |
|---|---|
| MCTS | At every move, the agent runs Monte Carlo Tree Search guided by its neural net. |
| Policy + Value Network | 2 PyTorch models output move probabilities + board value. |
| Self-Play Loop | The agent plays against itself using MCTS, collects training data, and improves over time. |
| No Human Supervision | Like AlphaZero, it learns only from its own games. |
| Training Signal | Targets come from final outcomes of self-play games and MCTS stats. |
Here's the first game I played against the bot
I won here but let's trace the steps that the model took. I got lucky by going first and taking the best position in the game (center) but every move after that, the agent knew to block my move if I'm close to a full row or mirror my move otherwise. This was really cool to see.
Here's the agent after I decided to not start in the middle
The agent was smart enough to know to cancel every move that I made.
AlphaTicTacToe/
│
├── README.md
├── requirements.txt # dependencies (torch, numpy, wandb, etc.)
├── .gitignore # .pt files, logs, etc.
│
├── alphattt/ # core logic package
│ ├── __init__.py
│ ├── config.py # hyperparameters and constants
│ ├── tictactoe.py # environment (board, rules, win logic)
│ ├── mcts.py # Monte Carlo Tree Search algorithm
│ ├── network.py # policy + value PyTorch model
│ ├── replay_buffer.py # replay buffer to store game trajectories for training
│ ├── self_play.py # self-play loop and data collection
│ ├── trainer.py # training logic using replay buffer
│ ├── evaluate.py # evaluate model vs random/minimax/human
│ └── utils.py # ELO rating, replay buffer, misc helpers
│
├── scripts/
│ ├── train.sh # train from scratch
│ └── evaluate.sh # evaluate trained agent
│
├── play_vs_agent.py # launch CLI game
├── web_demo.py # (coming soon) Gradio/Streamlit frontend
│
├── notebooks/
│ └── analysis.ipynb # Training curves, MCTS stats, etc.
│
├── wandb/ # W&B logs (auto-generated)
│
└── assets/
├── demo.gif
└── architecture.png - 1000s of self-play games
- MCTS with 25–100 simulations per move
- Replay buffer of game history
- Mini-batch training using PyTorch
- Tracked using wandb
Installing it is pretty straightforward:
git clone https://github.com/akhilvreddy/AlphaTicTacToe.git
cd AlphaTicTacToe
pip install -r requirements.txtYou can play against the agent that I've trained through CLI. (Hint: you will probably draw / lose if you decide to play second)
python play_vs_agent.py- AlphaGo paper
- AlphaGo Zero blog post
- Countless threads on RL that I've seen on X

