Feeds:
Posts

## Collaborative Software Architecture?

The Netflix \$1M collaborative filtering competition has ended with an outcome that is doubly-interesting.  First, there was a nice “photo-finish” with two separate groups passing the prize-winning mark and with the winner not yet announced.  More interesting is the fact that both contenders at this point are rather large coalitions each composed of several groups who were previously separately competing.  In fact, once the first such coalition passed the prize-wining mark, the second coalition formed, combined forces, and reached first place within the 30 days mandated by the rules.

Many have written about the significance of the fact that it is not one “magic bullet” that lets you win such a competition but rather the combination of many “small” ideas and much work.  What I am intrigued by is how such a fruitful combination of separately written pieces of software happens.  In the case of the first group, who have been working together for a while, one might take the point of view that it is just the usual miracle seen in any large software effort (open source or not) where a sound software architecture with well-defined interfaces lets many people contribute to a project.  (Just like Linux or Google Search keep getting better by the combined efforts of many programmers.)  However, in case of the second group, there seems to be something more interesting going on in the sense that the pieces were joined ex-post, seemingly with a very simple combination architecture. How can this be done?  Can it be done for other challenges as well?  (A competition to beat Google search?) Can competitions be designed with collaboration built in? (Maybe the second Netfilx competition?)

Let us look at an easy example for simple software collaboration: optimization given a set of constraints (such as Integer programming).  Here the input consists of a set of constraints on n variables, as well as an optimization goal defined on these variables.  The output should be an assignment of values to the variables that satisfies the constraints and with as high as possible a value for the optimization goal.  The nice thing about such problems is that collaboration is essentially built in by the problem definition.  Once clear input and output formats are defined, combining several optimizers is easy: just run all of them on the input instance, and output the best assignment returned by any of the component optimizers.  If the total measure of success of an algorithm is the average (say) value of the returned assignment over some distribution of problem instances, then the collaborative optimizer is certainly at least as good as any single component used, and if the different components used have distinct “strengths” then the collaborative optimizer is actually better than any single one.  This architecture naturally supports specialization: each optimizer focuses on some subset of instances and tries to do as good a job as possible on it, leaving other inputs for other optimization efforts.

Now back to Netflix-like machine learning challenges where the optimization goal is to produce a single predictor p that agrees as much as possible with the given target function t.  The natural ML tendencies would be, I suppose, to combine different p‘s using some kind of weighted voting or averaging, as indeed the second team seems to have done according to the Wired blog post:

To combine competitors’ algorithms, The Ensemble didn’t have to cut and paste much code together. Instead, they simply ran hundreds of algorithms from their 30-plus members (updated) and combined their results into a single set, using a variation of weighted averaging that favored the more accurate algorithms.

Thus the collaboration architecture is not different from the problem statement with the addition of a module that chooses the best weights for averaging.  When the aim is to minimize the sum-squares error, as was the case in the Netflix challenge, then the best coefficients for weighting the average can be efficiently obtained by linear regression which should be doable with these data set sizes (I think).

I can see many interesting theoretical and practical research issues in generalizing the scope of collaboration:

• How can we easily allow more sophisticated combinations?  E.g. a component that specializes in figuring out “who to ask” among other algorithms? Or branch-n-bound algorithms that use partial results from other optimizers to control branching?  In the Netflix case, one researcher found a way to take temporal dynamics into consideration — can this be effortlessly used as a subroutine in other components?  What kind of meta-data will be useful here?
• Can we get Organic-like growth with sub-algorithms constantly being added that utilize and then improve the previous state of affairs?  How do we best mimic natural-selection?  How do we avoid evolutionary dead-ends?
• How do we incentivize partial algorithms?  How do we determine the contribution of a single component relative to others?  I can immediately think of two types of models here: one based on a cooperative games formalism where for each subset S of components we have the value v(S) that its combination can bring, and another taking a sequential approach where each contribution gets credited according to the marginal improvement that it brings relative to the previous combination.  This is somewhat like an intellectual property model when the first one to come up with an idea gets the credit for it, but then others can build upon it — a model that presumably encourages rapid progress.
• What if we don’t have resources (say CPU time) to run all components, how do we choose which to use?  Obviously we want to take their cost-performance trade off into account.  Can we have components that specialize in such resource allocation?

## Motivation

The dogma of game theory (and economics) is that strategic situations will reach equilibrium — one way or another. How this will happen is often left un-answered, but from a computational point of view this is quite a central question. Indeed, a great amount of work has been done regarding the computational complexity of reaching an equilibrium of a game, reaching a conclusion that an infeasible amount of computation might be needed. In this post I want to discuss another aspect of the complexity of reaching equilibrium: the amount of information that needs to be transferred between the different players. The point of view here is that initially each player “knows” only his own utility function $u_i$ and then the players somehow interact until they find an equilibrium of the game $(u_1 .... u_n)$ defined by their joint utilities. The analysis here abstracts away the different incentives of the players (the usual focus of attention in game theory) and is only concerned with the amount of information transfer required between the players. I.e., we treat this as a purely distributed computation question where the input $(u_1 ... u_n)$ is distributed between the players who all just want to jointly compute an equilibrium point. We assume perfect cooperation and prior coordination between the players, i.e. that all follow a predetermined protocol that was devised to compute the required output with minimum amount of communication. The advantage of this model is that it provides lower bounds for all sorts of processes and dynamics and that eventually should reach equilibria: if even under perfect cooperation much communication is required for reaching equilibrium, then certainly whatever individual strategy each player follows, convergence to equilibrium cannot happen quickly.

The main result that we shall give in this post is that finding a pure Nash equilibrium (or determining that none exists) requires essentially communicating everything about the utility functions — no shortcuts are possible. This is the basic result that appears in two papers that studied the issue: Communication complexity as a lower bound for learning in games by Vincent Conitzer and Tuomas Sandholm that considered two-player games and How Long to Equilibrium? The Communication Complexity of Uncoupled Equilibrium Procedures by Sergiu Hart and Yishay Mansour whose focus is multi-player games. See also the presentation by Sergiu Hart. A future post will discuss the much more complicated case of finding a mixed Nash equilibrium, a question which is still mostly open, despite results given in the H&M paper.

## The Communication Complexity model

The basic communication complexity model considers $n$ players, where each player $i$ holds part of the input which we assume is an $N$-bit string, $x^i \in \{0,1\}^N$. The usual focus is on the two-player case, $n=2$, in which case we simplify the notation to use $x$ and $y$ rather than $x^1$ and $x^2$. The joint goal of these players is to compute the value of $f(x^1 ... x^n)$ where $f : (\{0,1\}^N)^n \rightarrow Z$ is some predetermined function. As $f(x^1 ... x^n)$ depends on all values of $x^i$, the players will need to communicate with each other in order to do so (we will use “broadcasting” rather than player-to-player channels), and they will do so according to a predetermined protocol. Such a protocol specifies when each of them speaks and what he says, where the main point is that each player’s behavior must be a function of his own input as well as what was broadcast so far. We will assume that all communication is in bits (any finite alphabet can be represented in bits). The communication complexity of $f$ is the minimum amount of bits of communication that a protocol for $f$ uses in the worst case. This model has been introduced in a seminal paper by Andy Yao, the standard reference is still the (just slightly dated) book Communication Complexity I wrote with Eyal Kushilevtiz, and online references include the wikipedia article, and the chapter from the Barak-Arora computational complexity book.

.

## The Disjointness Function

Our interest will be in lower bounds on the communication complexity. There are several techniques known for proving such lower bounds, the easiest of which is the “fooling set method” which we will use on the “disjointness function” for two players. The disjointness function, introduced in Yao’s original paper, plays a key role in communication complexity, and indeed it will turn out to be useful to us too: once we have a lower bound for it, it will be easy to easily deduce the lower bound for the function of our interest, finding a pure Nash equilibrium, by simple reduction.

The disjointness function: Alice holds $x \in \{0,1\}^N$ and Bob holds $y \in \{0,1\}^N$. They need to determine whether there exists an index $i$ where they both have 1, $x_i=1=y_i$. If we view $x$ as specifying a subset $S \subseteq \{1...N\}$ and $y$ as specifying $T \subseteq \{1 ... N\}$, then we are asking whether $S \cap T = \emptyset$.

Lemma: Any deterministic protocol for solving the disjointness problem requires at least $N$ bits of communication in the worst case.

Before proving this lemma, notice that it is almost tight, as Alice can always send all her input to Bob ($N$ bits) who then has all the information for determining the answer, who then sends it back to Alice (1 more bit). It is known that a $\Omega(N)$ lower bound also applies to randomized protocols, but the proof for the randomized case is harder, while the proof for the deterministic case is easy.

Proof: Let us look at the $2^N$ different input pairs of the form $\{ | \forall i,\: x_i+y_i=1\}$ (these are the “maximal disjoint” inputs). We will show that for no two of these input pairs can the protocol produce the same communication transcript (i.e. the exact same sequence of bits is sent throughout the protocol). It follows that the protocol must have at least $2^N$ different possible transcripts, and since these are all binary sequences, at least one of them must be of length at least $N$ bits.

Now comes the main point: why can’t two maximally disjoint pairs $$ and $$ have the same transcript? Had we been in a situation that $f(x,y) \ne f(x',y')$ then we would know that the transcripts must be different since the protocol must output a different answer in the two cases; however in our case all “maximal disjoint” input pairs give the same answer: “disjoint”. We use the structure of communication protocols — that each player’s actions can only depend on his own input (and on the communication so far) — to show that if $$ and $$ have the same transcript then $$ has the same transcript too (and so does $$). Consider the players communicating on inputs $$: Alice, that only sees $x$, cannot distinguish this from the case of $$ and Bob, that only sees $y'$ cannot distinguish it from the case $$. Thus when Alice communicates she will not deviate from what she does on of $$ and Bob will not deviate from what he does on $$. Thus, none of them can be the first to deviate from the joint transcript, thus none of them ever does deviate and the joint transcript is also followed on $$. We now get our contradiction since either $x$ and $y'$ are not disjoint or $x'$ and $y$ are not disjoint, and thus at least one of these pairs cannot not share the same transcript, a contradiction.   (The reason for non disjointness is that since $x \ne x'$ there must be an index $i$ with $x_i \ne x'_i$. If $x_i = 1$ and $x'_i=0$ then $y_i = 0$ and $y'_i=1$ in which case $x$ and $y'$ are not disjoint. Similarly, if $x'_i = 1$ and $x_i=0$ then $y'_i = 0$ and $y_i=1$ in which case $x'$ and $y$ are not disjoint.)

## Finding a Pure Nash Equilibrium

Back to our setting where each player $i$ holds his own utility function $u_i : S_1 \times ... S_n \rightarrow \Re$, where the $S_i$‘s are the commonly known strategy sets, which we will assume have size $m$ each. Thus each player’s input consists of $m^n$ real numbers, which we will always assume are in some small integer range. Perhaps it is best to start with a non-trivial (just) upper bound.  Let us consider the class of games that are (strictly) dominance solvable, i.e. those that iterated elimination of strictly dominated strategies leaves a single strategy profile, which is obviously a pure Nash equilibrium.  A simple protocol to find this equilibrium is the following:

• Repeat $mn$ times:
• For $i = 1 ...n$ do
• If player $i$ has a strategy that is dominated after the elimination of all previously removed strategies, then announce it, else say “pass”.
• Output the (single) profile of non-eliminated strategies.

What is remarkable about this protocol is that number of bits communicated is polynomial in $n$ and $m$, which may be exponentially smaller than  the total size of the input which is $O(m^n)$ numbers for each player.  Can some protocol like this be designed for all games that have a pure Nash equilibrium?  The basic answer in no:

Theorem: Every deterministic (or randomized) protocol that recognizes whether the given game has a pure Nash equilibrium requires $\Omega(m^n)$ bits of communication.

Note that this theorem implies in particular that one cannot design a more efficient protocol that finds an equilibrium just for games that are guaranteed to have one, as adding a verification stage (where each player checks whether the strategy suggested to him is indeed a best reply to those suggested to the others) would convert it to a protocol that recognizes the existence of a pure Nash.

Proof: The proof is by a reduction from the disjointness problem.  We will show that a protocol for determining the existence of a pure Nash equilibrium for games with $n$ players each having $m$ strategies can be used to solve the disjointness problem on strings of length $N = \Omega(m^n)$.

Let us start with the case $n=2$ and solve disjointness for length $N=(m-2)^2$.  The idea is very simple: Alice will build a utility matrix $u_A$ by filling it with the values of its bit string $x$ in some arbitrary but fixed order, and Bob will build a matrix $u_B$ from $y$ using the same order.  Now, any cell of the two matrices where $x_i=y_i=1$ will turn out to be a Nash equilibrium since both players get the highest utility possible in this matrix.  For this idea to formally work, we just need to make sure that no other cell is a Nash equilibrium.  This can be done in one of several ways (and different ones are used in the C&S and H&M papers), the simplest of which is adding two more rows and columns as follows:

With the addition of these rows and columns, each player’s best-reply always obtains a value of 1, and thus only a (1,1) entry may be a Nash equilibrium.

Now for general $n$.  We add to Alice and Bob $n-2$ players that have utilities that are identically 0.  Thus any strategy will be a best reply for these new players, and a pure Nash equilibrium is determined solely by Alice and Bob.  The main change that the new players bring is that the utility functions of Alice and Bob are now $n$-dimensional matrices of total size $m^n$.  Thus Alice and Bob can fold their length $N$ bit string into this larger table (as before keeping two strategies from their own strategy space to ensure no “none-(1,1)” equilibrium points), allowing to answer disjointness on $N=(m-2)^2 \cdot m^{n-2}$-bit long vectors.

## Variants

There are several interesting variants on the basic question:

1. The lower bound uses heavily the fact that the utilities of different strategies may be equal and thus there are many possible best replies.  Would finding a Nash equilibrium be easier for games in “general position” — specifically if there was always a single best reply for each player?  For the case of $n=2$ players, C&S show that the complexity does go down to $\Theta(m \log m)$.  (Exercise: show that the randomized communication complexity in this case is $\Theta(m)$.)  As $n$ grows, the savings are not exponential though and H&M show a $2^{\Omega(n)}$ lower bound for constant $m$.  A small gap remains.
2. We have seen that finding an equilibrium of a dominance solvable game is easy.  Another family that is guaranteed to posses a pure equilibrium is potential games.  However, H&M show that finding a Nash equilibrium in an ordinal potential game may require exponential communication.  I don’t know whether things are better for exact potential games.
3. The efficient protocol for dominance solvable games was certainly artificial.  However, an efficient natural protocol exists as well: if players just best reply in round robin fashion then they will quickly converge to a pure Nash equilibrium.  (This seems to be well known, but the only link that I know of is to one of my papers.)

1. A collection of AGT-related courses (more complete and up to date than the one on the sidebar of this blog) was compiled by Haris Aziz.
2. The “Privacy Enhancing Technologies award” is given to Cynthia Dwork and to Frank McSherry and Kunal Talwar for their work on differential privacy as well as its relation to mechanism design.  Congratulations.
3. The NYT had a piece on whether Game Theory can predict whether Iran will get the bomb.  The political scientist Bruce Bueno de Mesquita says so also in this TED talk.  Skepticism is of course in place.  (Hat tip to Edna Ullmann-Margalit and to Gil Kalai.)
4. It seems that the ever lasting discussion of which programming language to use for CS1 is heating up again.  This round seems to be about moving from Java to Python.  MIT’s switch from scheme to python last year is getting attention.

## HTML considered educational

My 9 year old girl is going now to an HTML summer camp. This was her choice, rather than, say, horseback riding or “molecular cooking”, mostly due to the influence of a friend (a girl too, mind you). Needless to say I was shocked by this choice: of all things related to computers, HTML syntax seems the most useless for humans to learn. And, yes, they are really learning to write these little HTML tags themselves — in notepad.

I have to admit that I’ve changed my mind. This is the first exposure that she has had to a formal language: some strange syntactic set of rules that has an attached semantic meaning. It is empowering: you can write some symbols and it does something that you intended it to do. You get immediate feedback, and can fix things if they are wrong. It is the first time that she sees why you need to follow the rules — otherwise it just won’t work. This is in contrast to what usually happens in school where all the rules for organizing her notebooks or homework may be safely ignored unless the teacher “catches” them. (Example: The way she solves $1 \frac{1}{2} + 3 \frac{1}{3}$ is $1 \frac{1}{2} + 3 \frac{1}{3} = \frac{3}{6}+\frac{2}{6} = 4 \frac{5}{6}$, where the point is that it’s wasteful to copy the integer parts when you’re doing the common denominator.)

Now this friend of hers is really into scratch, which is real programming, and my own daughter seems to be getting into it too. Even though I did accidentally mention that scratch is programming, this didn’t seem to turn them off or appear “un-girly” to them (in fact my daughter’s first free association with programming was the Segway and free ice cream that she’s seen in Google). Starting to program may turn out to be another benefit of HTML summer camp.

## Math Blogosphere

Physics blogger John Baez is writing a piece for Notices of the AMS on “what do mathematicians need to know about blogging?” and asks on his blog:

So, just to get the ball rolling, let me ask: what do you think mathematicians need to know about blogging?

Many of the comments are interesting.  For example Terry Tao says:

I’m of course very enthusiastic about blogs as a medium for mathematical communication; it seems to fill in a niche between formal mathematical publications and informal seminars and conversations, as it combines the durable availability of the former with the interactivity of the latter.