This is the second post in a series recapping the talks at the CMU summer school on algorithmic economics (here is the first). See the summer school site for abstracts, slides, and (eventually) videos.
Next I’ll talk about the two lectures by Avrim Blum, a CMU local, about the many connections between online learning and game theory. Avrim began with a masterful development of Littlestone and Warmuth’s randomized weighted majority algorithm . This is an algorithm for making decisions online (e.g., which route to take to work each day) that provides a “no-regret” guarantee: your time-averaged performance (e.g., average travel time) approaches that of the best fixed option in hindsight.
Suppose there are N so-called experts (e.g., newsletters on investing) who each give advice about a binary action (e.g., whether to buy or sell stocks tomorrow). Your job is to choose one of the two actions based on this expert advice. You incur cost 1 if you choose incorrectly, 0 otherwise. To develop some intuition, first make the unrealistic assumption that some expert always predicts correctly — how long before you can identify it? The trick is to always take the action recommended by the majority of the experts who have made no mistakes thus far. Then, every time you make a mistake, you eliminate at least half of the remaining experts from future consideration. Thus, you identify the omniscient expert after at most log N mistakes.
The majority trick seems like a good idea. How can we extend it to the case of no omniscient expert? One idea is to re-start the above algorithm as soon as every expert has made at least one mistake — then the number of mistakes that we make is at most log N times that of the best expert. To do better, we should be more mindful of how experts performed in the past. A simple but great idea is to associate a weight (read: credibility) with each expert, intitially 1, and multiply this weight by every time the expert makes a mistake (where is a parameter to be tuned later). In the randomized weighted majority algorithm, at every time step you choose an action with probability proportional to the sum of the current weights of the experts that advocate that action.
The analysis of the algorithm is jaw-droppingly slick. At each time step: if the probability of a mistake by the algorithm is p, then the expected value of the experts’ total weight drops by a factor. On the other hand, if there is an expert that makes merely m mistakes, then its weight of single-handedly provides a lower bound on the experts’ total weight. Using the total weight of the experts as an intermediary, we’ve related the two quantites we care about — the algorithm’s cost and the quality of the best expert. Rewriting (taking logs and using a deft Taylor expansion) shows that the expected number of mistakes made by the algorithm is at most . Choosing to balance the two terms gives the desired no-regret guarantee.
The algorithm and analysis are flexible and permit several extensions. For example, to handle general bounded costs (in [0,1], say), just use the weight multiplier when following the ith expert leads to a cost of . An analogous algorithm and analysis handles bounded rewards instead of bounded costs. A more impressive extension is to the “bandit setting”, meaning the only feedback you get is the reward of the chosen strategy (e.g., you only learn the travel time of the route that you chose, not those of unchosen routes). Here Avrim covered the elegant reduction of Auer, Cesa-Bianchi, Freund, and Schapire from the bandit setting to the full-feedback setting. To use a no-regret algorithm as a black box, one must fabricate a reward for every action (the input that the black box expects) from the reward of the action you took (which is all you learn). The first trick is to assign zero reward to all unchosen actions. The second trick is to make sure the expected reward vector is unbiased by dividing the reward of the chosen action by the probability of taking it — note this can increase rewards significantly. The third trick is to control the maximum fabricated reward size by mixing in a little bit of the uniform distribution when choosing actions. The bottom line is that there are no-regret algorithms for the bandit setting, as well; the only catch is that the additive loss blows up from logarithmic in N to polynomial in N.
Having made all of us experts in online learning, Avrim moved on to applications in game theory. Here’s an amazing one: the existence of a no-regret algorithm, like randomized weighted majority, directly implies von Neumann’s Minimax theorem! Contrast this with the original proof (via a fixed-point theorem) or later proofs that explicitly use linear programming duality. Recall the Minimax theorem says that, under optimal randomized play in a zero-sum two-player game (e.g., randomizing uniformly in rock-paper-scissors), it doesn’t matter if you announce your strategy before or after that of the other player — you get the same expected payoff either way. One inequality is easy — going last can only help you. The reverse inequality can be proved by letting the row and column players play the game repeatedly, each choosing a strategy at each time step using some no-regret algorithm (where strategies correspond to experts, the game payoffs to rewards). The empirical distributions of the row and column players’ chosen strategies approach a minimax pair (equivalently, a Nash equilibrum): essentially, competing with the best expert in hindsight translates to each player doing as well as a best response to the other player’s empirical distribution (i.e., as well as choosing its strategy after that of the other player).
In non-zero sum games, simultaneous regret-minimization does not generally yield a Nash equilibrium (do you see why?), but the empirical distribution of joint play approaches the set of “coarse correlated equilibria” (see also Eva’s talk), a relaxation of correlated equilibria. There are also simple online algorithms that guarantee convergence of joint play to the smaller set of correlated equilibria — one uses an appropriately more stringent regret notion (swap regret) and constructs online algorithms that have vanishing swap regret (e.g., via a black-box reduction to standard regret-minizmization).
Avrim wrapped up by surveying some of his recent work in algorithmic game theory, including item pricing (see here for a cute vertex pricing problem in graphs and here for general results that follow from “random power of 2″ pricing schemes) and ways of guiding game dynamics to good Nash equilibria (see here for positive results for fair cost-sharing in neworks).
Recall from Avrim’s talk the history of the minimax theorem for two-player zero-sum games. von Neumann first proved it using Brouwer’s fixed-point theorem, but later proofs via linear programming duality lead to polynomial-time algorithms for computing a Nash equilibrium in such games. The known proofs of Nash’s theorem — stating that every finite nonncooperative game has at least one (mixed-strategy) Nash equilibrium — rely on fixed-point theorems. Should we be looking for a new proof of Nash’s theorem, one that results in an efficient algorithm? If not — if computing Nash equilibria is no easier than computing Brouwer fixed points — how would we prove it?
Costis began with a discussion of Brouwer’s fixed-point theorem, which states that every continuous function on a convex compact set has at least one fixed point. He outlined the proof for simplicies, which follows from Sperner’s lemma (see also Ariel’s recent post on the whining philosophers problem). This proof reduces approximate fixed-point computation to a certain type of path-following in an exponential-sized graph. In 1990, Papadimitriou defined the complexity class PPAD, essentially as all problems solvable via such a path-following algorithm. Suitable fixed-point computation problems are PPAD-complete — as hard as every problem in PPAD. Computing a Nash equilibrium is a PPAD problem. A tour de force reduction from 2005 (see here and here) proves that computing a Nash equilibrium is PPAD-complete, even in two-player non-zero-sum games. In this sense, fixed-point theorems are fundamental to Nash’s existence result, not an artifact of his proof.
What does this computational intractability mean? Paraphrasing Scarf, it suggests that Nash equilibria cannot always be used for the evaluation of economic activity (since you can’t efficiently solve for your model’s predictions). For similar reasons, Costis argued that the Nash equilibrium concept fails as a universal prediction of human behavior.
So what now? An important research direction is to identify conditions on a game that guarantee tractable equilibrium computation. Costis finished on a positve note with his tractability results for a multi-player generalization of zero-sum games . The model is the following. There is an undirected network. Each node is a player. Each edge is annotated with payoffs for a two-player game, played between its endpoints. Each node picks a single strategy, and its total payoff is the sum of the payoffs from the games in which it participates (one per incident edge). The sum of the players’ total payoffs is assumed to be zero in every strategy profile. (A special case is when every edge’s game is itself zero-sum.) Amazingly, this global zero-sum condition is enough to recover equilibrium tractability: a Nash equilibrium can be computed in polynomial time, the set of equilibria is convex, and play by no-regret learning algorithms converges to the set of equilibria.
2. Costis looked surprised when I suggested that almost none of the summer school students would have seen him talk about the complexity of equilibria before — he gave zillions of talks on the topic during 2007–2009, but this is already a new generation of PhD students!
3. Three-player zero-sum games are as general as two-player non-zero-sum games (proof: add a dummy player), so Nash equilibrium computation is PPAD-hard. Recovering tractability thus requires additional restrictions on the game.