QSnake is a Snake game written in C++ programming language, based on the Q-Learning algorithm. It is a fun and educational project that demonstrates how reinforcement learning, specifically Q-learning, can be applied to a classic game like Snake. The game’s AI learns to play Snake by interacting with the environment, receiving rewards for its actions, and improving over time.
To install and run the game, follow these steps:
git clone https://github.com/dildeolupbiten/QSnake.git
cd QSnake
make
./QSnake
- C++11 or later
- make
- A terminal emulator
The game utilizes the Q-Learning algorithm, a type of reinforcement learning. The agent (the snake) takes actions based on the environment (the grid), receives rewards (such as points or penalties), and updates its knowledge over time in a Q-table.
The agent (snake) interacts with the environment, exploring actions and learning over time.
void train_snake(
Agent *agent,
Snake *snake,
const double learning_rate,
const double discount_factor,
const double epsilon,
const int episodes,
const int play,
const double sleep
) {
double avg_size = 0;
int sleep_time = sleep ? (double)1000 / sleep : 0;
std::cout << CLEAR;
for (long episode = 0; episode < episodes; episode++) {
snake -> reset();
if (episode >= episodes - play) { std::cout << CLEAR; }
while (!snake -> done) {
choose_action(agent, snake, epsilon);
int reward = get_reward(snake);
size_t next_key = agent -> get_key(snake);
agent -> init_key(next_key);
update_q_value(
agent,
snake,
next_key,
reward,
learning_rate,
discount_factor
);
if (play && episode >= episodes - play) {
snake -> print_grid();
std::this_thread::sleep_for(
std::chrono::microseconds(sleep_time)
);
}
}
if (play && episode >= episodes - play) { continue; }
avg_size += ((double)snake -> body.size() - avg_size) / (episode + 1);
std::cout << "Episode: " << episode << " Avg. Size: " <<
avg_size << std::endl;
}
}There are 4 basic actions in the snake game. The snake does a random action during the exploration phase. Exploration rate is set with the epsilon parameter. By default epsilon is equal to .0001 which means that .01 % of the actions will be selected randomly and 99.99 % will be selected based on the action with highest q value.
| Actions | x | y |
|---|---|---|
| UP | -1 | 0 |
| RIGHT | 0 | 1 |
| DOWN | 1 | 0 |
| LEFT | 0 | -1 |
void choose_action(Agent *agent, Snake *snake, const double epsilon) {
if (uniform(0, 1) < epsilon) {
snake -> action = randrange(0, 4);
snake -> key = agent -> get_key(snake);
agent -> init_key(snake -> key);
} else {
snake -> key = agent -> get_key(snake);
agent -> init_key(snake -> key);
snake -> action = max_index(agent, snake, snake -> key);
}
}In this program, the state that occurs after each action of the snake is represented with two qualities. Though, action-based states can be represented also with more or less qualities. An optimal state representation should contain the information about the environment, the body itself, the consequences of the actions taken by the snake and shoulp support the snake's learning to eat foods without colliding its body or the walls.
The direction state is a one-dimensional array representing the types of collisions for each action type.
The collision types could be encoded differently. The important thing here is to distinguish between collision types and assign a number to each collision type in terms of the consequences of the actions.
| Directions | No collision | Target collision | Body Collision | Wall Collision | Dangerous Action |
|---|---|---|---|---|---|
| UP | 0 | 2 | -2 | -2 | -1 |
| RIGHT | 0 | 2 | -2 | -2 | -1 |
| DOWN | 0 | 2 | -2 | -2 | -1 |
| LEFT | 0 | 2 | -2 | -2 | -1 |
void Snake::set_directions() {
for (int i = 0; i < 4; i++) {
int x = body[0].x + ACTIONS[i].x;
int y = body[0].y + ACTIONS[i].y;
if (is_collision(x, y)) { directions[i] = -2; }
else if (!is_safe_action(x, y)) { directions[i] = -1; }
else if (x == target.x && y == target.y) { directions[i] = 2; }
else { directions[i] = 0; }
}
}
int Snake::flood_fill(const int x, const int y, const int i, int visited[]) {
if (is_collision(x, y) || visited[x * width + y]) { return 0; }
visited[x * width + y] = 1;
int size = 1;
for (auto& action : ACTIONS) {
bool has_tail = false;
Position tail;
if (i > 1) {
has_tail = true;
tail = body[i];
grid[tail.x * width + tail.y] = 0;
}
size += flood_fill(x + action.x, y + action.y, i - 1, visited);
if (has_tail) { grid[tail.x * width + tail.y] = 1; }
}
return size;
}
int Snake::is_safe_action(const int x, const int y) {
int visited[height * width] = {0};
int size = body.size();
return flood_fill(x, y, size - 1, visited) > size;
}The distance state is the normalized distance between the head of the snake and the target.
If the distance is positive, it is represented with 1, if negative, with -1, if there's no difference, it's represented with 0.
| Δd | Δd < 0 | Δd = 0 | Δd > 0 |
|---|---|---|---|
| x | -1 | 0 | 1 |
| y | -1 | 0 | 1 |
void Snake::set_distance() {
distance = (Position){
(body[0].x > target.x) - (body[0].x < target.x),
(body[0].y > target.y) - (body[0].y < target.y)
};
}The State key is created by hashing 6 integer values: 4 related to the direction state and 2 related to the distance state.
Below you see a state representation of an action.
| State Key | UP | RIGHT | DOWN | LEFT | Δdx | Δdy |
|---|---|---|---|---|---|---|
| 3005401531224964638 | 0 | 2 | -2 | 0 | 1 | -1 |
size_t Agent::get_key(Snake *snake) {
snake -> set_distance();
snake -> set_directions();
std::array<int, 6> state;
for (int i = 0; i < 4; i++) { state[i] = snake -> directions[i]; }
state[4] = snake -> distance.x;
state[5] = snake -> distance.y;
size_t key = 0;
for (int i = 0; i < 6; i++) {
key ^= std::hash<int>{}(
state[i]
) + 0x9e3779b9 + (key << 6) + (key >> 2);
}
return key;
}Rewards are determined arbitrarily and may be decreased or increased:
| No collision | Target collision | Body collision | Wall collision | Dangerous Action |
|---|---|---|---|---|
| 0 | 2 | -2 | -2 | -1 |
int penalty_for_obstacle_collision(Snake *snake) {
if (snake -> directions[snake -> action] == -2) {
snake -> done = 1;
return -2;
}
return 0;
}
int reward_for_target_collision(Snake *snake) {
snake -> increase_body();
if (snake -> directions[snake -> action] == 2) {
snake -> set_target();
return 2;
}
snake -> decrease_body();
return 0;
}
int penalty_for_dangerous_action(Snake *snake) {
if (snake -> directions[snake -> action] == -1) { return -1; }
return 0;
}
int get_reward(Snake *snake) {
int reward = 0;
reward += penalty_for_obstacle_collision(snake);
if (snake -> done) { return reward; }
reward += reward_for_target_collision(snake);
reward += penalty_for_dangerous_action(snake);
return reward;
}Q values are updated according to the Bellmann equation.
double q_algorithm(
const double current_q,
const double max_next_q,
const int reward,
const double learning_rate,
const double discount_factor
) {
double next_q = reward + discount_factor * max_next_q;
return (1 - learning_rate) * current_q + learning_rate * next_q;
}
void update_q_value(
Agent *agent,
Snake *snake,
const size_t next_key,
const int reward,
const double learning_rate,
const double discount_factor
) {
double current_q = agent -> table[snake -> key][snake -> action];
int max_next_q_index = max_index(agent, snake, next_key);
double max_next_q = agent -> table[next_key][max_next_q_index];
double new_q = q_algorithm(
current_q,
max_next_q,
reward,
learning_rate,
discount_factor
);
agent -> table[snake -> key][snake -> action] = new_q;
}The agent.dat file is used to store the Q-table generated by the Q-learning algorithm. This file keeps track of the agent's learning progress and allows the snake to continue learning or resume its training from the point where it left off.
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.

