This repository contains my notes and explorations while learning from the book Multiagent Systems by Yoav Shoham and Kevin Leyton-Brown. Each folder corresponds to a chapter or topic from the book, with Jupyter notebooks working through the concepts, algorithms, and examples.
Distributed constraint satisfaction addresses how agents with independent constraints can collectively reach a solution that satisfies all requirements.
Imagine a network of cell towers, where each tower needs to use a different frequency to those nearby. This is basically a graph coloring problem, where the frequencies are the colors.
This notebook introduces domain pruning, where each node eliminates values from its domain that can't possibly work. It goes something like this:
X3 removes red...
X2 removes red...
X2 removes blue...
Filtering algorithm progressively reducing domains until a solution is found.
This works eventually, but it can get stuck if there are multiple solutions and won't always identify if a problem has no solution.
This notebook also introduces the hyper-resolution algorithm, which uses a general rule from propositional logic. It works by having nodes generate new constraints at each step, called "nogoods", and communicate those to their neighbours.
While simple domain pruning is also expensive, so heuristic alternatives are often used.
One centralised strategy for solving a constraint problem is to do recursive search of values, backtracking when you find one that doesn't work. We can take that idea and try to make it work in a decentalised way by having each node simply choose a value that works with it's current neighbours.
While easy to explain, it unfortunately sometimes fails to find a solution even when there is one. e.g.,
The algorithm fails in this case.
The asynchonous backtracking algorithm can be thought of as a greedy version of the hyper-resolution algorithm, meeting somewhere between the two previous ideas. Essentially, we want every node to be trying values, and then backtracking with a 'nogood' when it can't.
Asynchronous backtracking progressively finding a solution through message passing.
Distributed Optimization is about how agents with different sets of information can optimise a global objective function.
Dijkstra's algorithm builds paths outward from the origin by always selecting the node with the shortest distance. The asynchronous dynamic programming instead has nodes updating their distance estimates in any order. It still converge to the optimal solution.
node s found a new shortest path to a with length 1
node s found a new shortest path to b with length 3
node a found a new shortest path to b with length 2
node a found a new shortest path to c with length 3
node b found a new shortest path to d with length 4
node c found a new shortest path to t with length 4
Asynchronous dynamic programming progressively finding shortest paths.
The notebook also covers Learning Real-Time A* (LRTA*), where multiple agents explore the graph simultaneously, sharing distance estimates that underestimate the true cost. When an agent repeats the same path twice, it is proven to be optimal.
This notebook reviews Markov Decision Processes (MDPs), which extend path-finding to stochastic environments. The Bellman equation provides the optimal value function and is proven to be a contraction mapping. A go-kart racing example demonstrates the concepts:
Go-kart racing MDP policy with two lanes and stochastic transitions.
When multiple agents coordinate in an MDP, the state space becomes huge. This notebook explores variable elimination to reduce complexity by factorizing the value function when agents have sparse dependencies. The key insight is that if agent i's Q-function only depends on actions of agents i and i+1, we can pass messages forward to find optimal joint actions efficiently. An example shows agents choosing positions where each gets utility from their choice plus a bonus for aligning with the next agent:
Agents coordinating positions, variable elimination makes this fast to compute.
A review of linear programming and the simplex algorithm. Visualizes 2D and 3D examples of moving between vertices of the feasible region.
Every linear program has a dual formulation. The fundamental theorem: if both have optimal solutions, their values are equal.
This notebook connects optimization to economics through the assignment problem, where agents must be matched to objects to maximize total value. The problem can be solved as a linear program, but remarkably also has a decentralized solution: competitive equilibrium prices exist where each agent selfishly picks their best option at those prices, yet the result is globally optimal. This leads naturally to auction mechanisms where agents bid for items, with prices rising until equilibrium is reached.
This notebook extends the assignment problem to scheduling, where agents need time slots to complete tasks before deadlines. Unlike simple assignment, the solutions require integer programming since agents may need multiple consecutive slots. The competitive equilibrium approach still applies: by setting appropriate prices for time slots, agents can independently choose their schedules in a way that produces globally optimal allocations. This demonstrates how price-based mechanisms can coordinate complex multiagent decisions.
This notebook explores social laws: rules that constrain agent behavior to improve global outcomes without requiring continuous coordination.
Here self-interested agents who have their own preferences are introduced.
This notebook establishes the mathematical framework of utility theory. The Von Neumann-Morgenstern utility theorem shows that if preferences satisfy certain axioms, they can be represented by a utility function over lotteries. The point is to show that utility is grounded in the maths of preferences.
A normal-form game consists of players, actions, and payoff functions. Games are represented as matrices with utilities for each player based on action profiles. Pure strategies are introduced as well as mixed strategies, where players randomize over actions.
The notebook introduces classic games like Prisoner's Dilemma: If both cooperate they get 5 each. If one cooperates and one defects, one gets 8 and the other gets 0. If both defect they get 2. The best option is (C,C) but the outcome is (D,D). Notice below that regardless of what player 2 does, player 1 is better off defecting (and vice versa).
Heatmap showing utilities for different mixed strategy combinations.
Talks about Nash equilibrium, which is where no player can improve by unilaterally changing their strategy. For Prisoner's Dilemma, (D,D) is the unique Nash equilibrium, even though it's not optimal.
This notebook explores refinements. Minimax strategies for zero-sum games, where each player minimizes the maximum possible loss. Correlated equilibrium, where a trusted third party recommends actions. Also discusses dominated strategies and iterated elimination of dominated strategies.
Unfortunately, computing Nash equilibria is computationally challenging - it belongs to a class called PPAD, which is believed to grow exponentially. This section covers algorithms for finding equilibria.
This notebook starts with the easy case: two-player, zero-sum games, where computing a Nash equilibrium reduces to solving a linear program.
It then moves to general two-player games, where equilibrium computation is harder. The notebook introduces an LCP (linear complementarity problem) formulation and the Lemke-Howson algorithm as a pivoting method for finding equilibria.
One of the key geometric ideas is the “labelling” view. At equilibrium, each mixed strategy is labelled by its zero-probability actions and the opponent’s best-response actions, which means equilibrium occurs where the combined labels cover all actions. This leads to a path-following algorithm that traces edges of the best-response polytopes to find equilibria.
The notebook visualizes this labelling process:
Lemke-Howson algorithm: utility curves show best responses, labelling diagrams identify equilibrium points where all actions are covered.
This does a deeper dive into the Lemke-Howson algorithm. Explanations of pivoting too.
This notebook explores gradient-based method,s where equilibrium-finding is formulated as optimization.
One way to make the computation easier is to iteratively remove dominated strategies until no more exist, this notebook looks at domination by pure and mixed strategies in detail.
A correlated equilibrium involves a trusted third party that recommends joint action profiles to players. Unlike Nash equilibria where players independently randomize, here players can condition on correlated signals (like traffic lights). Computing these is just a linear program - the constraints ensure no player wants to deviate from the mediator's recommendations. However, the algorithm finds many different correlated equilibria, not necessarily the intuitive ones. While every Nash equilibrium is also a correlated equilibrium, the ease of computing correlated equilibria doesn't help find Nash equilibria since Nash requires independent probability calculations.
Normal-form games assume simultaneous moves. This section introduces extensive-form games for sequential decision-making using game trees.
Games where every player knows the complete history (like chess). The game is a tree with decision nodes, actions, and payoffs. Any extensive-form game can be converted to normal-form, but this can introduce problematic Nash equilibria involving non-credible threats. Subgame-perfect equilibrium is a refinement that requires Nash equilibrium in every subgame. Backward induction computes subgame-perfect equilibria efficiently by working backwards from the leaves:
Backward induction solving a game tree (green annotations show calculated values).
In many games, players don't observe all previous moves (like poker). Information sets group nodes a player can't distinguish (shown as dashed blue lines):
Game with information sets representing imperfect information.
A key concept is behavioral strategies: instead of mixing over complete strategies (as in normal-form), players randomize independently at each information set. For games with perfect recall (where players remember their own past actions), behavioral strategies are equivalent in expressive power to mixed strategies. Without perfect recall, some mixed strategies cannot be represented as behavioral strategies and vice versa.
The sequence form provides efficient computation for imperfect-information games with perfect recall. A sequence is the set of actions a player takes to reach a node. A realization plan assigns probabilities to sequences such that probabilities sum correctly up the tree. Computing best responses becomes a linear program (or its dual), dramatically more efficient than enumerating all behavioral strategies.
This notebook also covers sequential equilibria, which extend subgame-perfect equilibrium to imperfect-information games. The key challenge: when information sets group nodes together, players must form beliefs about which node they're at. Sequential equilibrium requires strategies to be optimal given these beliefs:
There are two Nash equilibria: LU and RD. But it's a bit odd! In truth player 1 would realise if they do R then player 2 will go D, which is better for player 1 than LU. The only true equilibria is then RD. We can see this if we compute the subgames. In this case the subgames are trivial as each player just makes the best decision they can:
When it comes to imperfect-information games this concept is extended into "sequential equilibria" and gets a bit more complicated. Let's assume that the second player doesn't know what the first did. Then we get something more like this:
We can no longer just push back the values, because what player 2 will do is dependent on what player 1 does. We have to deal with a forest of subgames. Notice that the subgame from C and from R have different optimal solutions! So we can't use the subgame-perfect equilibrium logic.
Many real-world interactions don't fit one-shot normal-form games or finite extensive-form games. This section explores richer representations: repeated games, stochastic games, Bayesian games with incomplete information, and compact representations.
When agents interact repeatedly, cooperation can be sustained through future punishment threats. The folk theorem says any outcome better than minimax can be sustained as equilibrium. The notebook explores bounded rationality using finite automata to represent strategies, constraining computational complexity:
Classic Tit-for-Tat strategy as a finite automaton.
The notebook demonstrates computing Nash equilibria for automata-restricted strategies in repeated games, showing that not all equilibria involve always defecting.
A stochastic game combines game theory with MDPs - multiple states with transitions after actions. This generalizes both normal-form games (one state) and MDPs (one player). Strategies can be behavioral, Markov, or stationary. The notebook uses value iteration adapted for two players, computing Q-functions and finding Nash equilibria at each state. However, this optimizes per-state rather than globally, potentially missing some equilibria. Computing equilibria is generally much harder than single-player MDPs.
Players have private information ("types"). For example, in auctions each bidder knows their own valuation but not others'. Three equivalent representations: information sets, extensive form with Nature making initial random choices, and epistemic types with a common prior. Expected utility can be computed ex post (knowing all types), ex interim (knowing your own type), or ex ante (knowing nothing). Mechanism design uses Bayesian games extensively.
Extensive form representation with Nature node choosing between four games (Matching Pennies, Prisoner's Dilemma, Coordination, Battle of Sexes). Information sets show what each player knows about the game being played.
Model shared resources like traffic or network routing where costs depend on how many agents use each resource. Key property: every congestion game is a potential game, guaranteeing a pure-strategy Nash equilibrium. The notebook proves congestion games are potential games and demonstrates myopic best-response converges to equilibrium. Price of anarchy measures how much worse equilibrium is compared to social optimum, and can grow unbounded in some networks.
For many players, full payoff matrices are intractable. This notebook covers polynomial-type games and efficient expected utility computation. Graphical games represent payoff dependencies as a graph where agents only depend on neighbors. Action-graph games compress actions rather than players. Multi-agent influence diagrams (MAIDs) combine graphical games with extensive forms. These representations enable polynomial-time algorithms for equilibria and expected utility when dependencies are sparse.
Multi-agent influence diagram showing two players (grey shades) deciding about a tree and patio. Rectangles are decisions, ovals are random events, diamonds are utilities. Dotted arrows indicate information flow; solid arrows indicate causal effects.
Learning in game theory involves repeated or stochastic games where agents adapt over time. The key: it's not just learning in isolation - agents are learning about other agents who are also learning. Players teach and learn with each other.
Learning in multiagent systems is fundamentally different from single-agent learning. The environment isn't stationary - when others change strategies, your optimal strategy changes. Additionally, intelligent agents aren't just learning but also teaching through their actions. Example: a player might play suboptimally in the short term to teach opponents they'll cooperate, enabling better long-term outcomes.
Fictitious play: each agent tracks the empirical frequency of opponents' past actions and best-responds to it. Agents maintain a count of how many times opponents played each action, normalize to get probabilities, then play a best response. Key properties: (1) pure-strategy Nash equilibria are steady states of fictitious play, and vice versa; (2) if empirical distributions converge, they converge to Nash equilibrium. Guaranteed convergence for: zero-sum games, games solvable by iterated elimination of strictly-dominated strategies, potential games, and 2×n games with generic payoffs. Rock-Paper-Scissors demonstrates non-steady-state convergence: no fixed pure strategies, but empirical frequencies converge to uniform (1/3, 1/3, 1/3), as expected.
Rational learning extends fictitious play by allowing agents to model entire opponent strategies (like tit-for-tat or trigger strategies) rather than just action frequencies. Players maintain beliefs over a set of possible strategies, update via Bayesian inference after observing opponent actions, and choose best responses to their beliefs. Example: in repeated Prisoner's Dilemma, if an opponent cooperates for two rounds, a rational learner eliminates "always defect" and "trigger at t=0" from consideration, concentrating beliefs on longer-horizon strategies. Can include a mistake probability to prevent beliefs from zeroing out (allowing for strategy changes). Converges to ε-equilibrium but requires careful tuning of belief updates and best-response functions.
Reinforcement learning doesn't assume agents know transition probabilities or opponent payoffs. Q-learning updates action-values via Q(s,a) ← (1-α)Q(s,a) + α(r + γV(s')), with learning rate α declining as 1/t. The notebook demonstrates Q-learning on a simple 3-state MDP, showing convergence to optimal values without knowing transition matrices. For multiagent zero-sum games, extend Q to include opponent actions Q(s,a,o) and compute V via minimax: V(s) = max_a min_o Q(s,a,o).
Regret measures the difference between your average reward and the reward from always playing some fixed pure strategy. A no-regret algorithm guarantees non-positive regret for all pure strategies. Regret matching: at each timestep, choose actions with probability proportional to their regret (how much better that action would have been). If all players use no-regret learning, empirical action frequencies converge to correlated equilibria.
When modeling many agents, use population-level dynamics with θ(a) being the fraction playing action a. The growth rate of strategy a is θ(a)(u(a) - ū), where u(a) is the payoff to a and ū is the population average. Strategies above average grow; those below shrink.
So far we've assumed agents observe actions but don't communicate directly. This section explores when and how communication matters in strategic settings.
Does communication before the game change things? For Prisoner's Dilemma: well, no. Regardless of what both players say, they will each still choose to defect. Cheap talk is communication that's costless and non-binding.
Game tree with pre-play cheap talk communication stage.
In games with multiple equilibria (Battle of the Sexes), cheap talk can help coordinate. You can see that doing the opposite of what was said is an equilibrium too... How confusing!
Signalling games model asymmetric information: Nature chooses a game type, Player 1 observes it and takes an action, Player 2 observes only the action (not the game type) and responds. Player 1's action serves as a signal about the hidden information. Unlike cheap talk, signals can be credible because they're costly or constrained.
Nature chooses between Prisoner's Dilemma and Stag Hunt. Player 1 sees the game and chooses C or D. Player 2 observes only Player 1's action (shown by information sets) and must respond without knowing which game is being played.
The challenge: Player 1's optimal action depends on what Player 2 will do, but Player 2's optimal response depends on inferring which game is being played from Player 1's action. Simple Bayesian reasoning fails because Player 1 anticipates Player 2's inference.
Two zero-sum games where Player 1 can signal the game type through action choice. If Player 1 always plays B, they average 2.5 regardless of Player 2's response, better than playing each game's Nash equilibrium (which would average 1.5).
Speech-act theory distinguishes three aspects of speech: locutionary (literal content), illocutionary (speaker intention), and perlocutionary (effect on listener). Grice's cooperative principles govern conversation: Quantity (provide needed information), Quality (be truthful), Relation (be relevant), and Manner (be clear). These principles explain implicature - how "Harry hasn't gone to prison yet" conveys more than its literal meaning.
Signaling game modeling ambiguous language. Nature determines whether a wooden chair or the meeting chair is coming. The speaker chooses clarity level (costly to be more specific). The listener infers meaning from ambiguous statements. Optimal strategy: be clear for the less likely event, use shorthand for the more likely one.
Tied up with the value of information.
Everything before has assumed clear utility functions. But in many cases you can't get utilities without them being gamed or misreported. Instead you can ask for preferences, i.e., choose A or B.
This notebook explores what properties we want from voting systems and why that's surprisingly difficult.
A sensible approach is to meet the Condorcet condition, where an option is selected if it beats every other in pairwise majority comparisons. But this doesn't always work, often there is no Condorcet winner. Plurality voting, Borda voting, pairwise elimination, and approval voting are explored.
Arrow's Impossibility Theorem is presented, where any social welfare function that produces a full ranking of outcomes must violate at least one of three basic fairness properties: Pareto efficiency (if everyone prefers A to B, rank A higher), independence of irrelevant alternatives (the ranking of A vs B shouldn't depend on C), or non-dictatorship (no single person determines everything). The proof shows that any decisive coalition, one that can force a ranking, can be split into smaller decisive coalitions recursively until you reach a single dictator.
There are some positive results if you restrict the problem. If outcomes are just the agents themselves and each agent has binary preferences (approve/disapprove), then approval voting satisfies analogous fairness conditions. For ranking systems in this setting, you can build iterative algorithms where votes propagate through the preference graph, achieving weaker but still meaningful fairness properties.
Mechanism design is about getting agents to behave in desirable ways. E.g., to be truthful. While game theory analyzes existing games, mechanism design inverts the question: what game should we design to achieve desired outcomes?
A mechanism consists of messages agents can send and an outcome function. A social choice function specifies what outcome should occur for each preference profile. Can we design a mechanism that implements this?
The revelation principle: Any mechanism can be converted to a direct mechanism where agents report types truthfully. Because the original mechanism is designed such that agents have dominant strategies there is no reason to lie.
Vickrey auction: Second-price sealed-bid where highest bidder pays second-highest bid. Truthful bidding is dominant!















