As pointed out by Michael Mitzenmacher and Rakesh Vohra (who will soon carry the respectable but grammatically weird title “Penn Integrates Knowledge Professor”), the National Science Foundation SSS program is trying an intriguing new peer review method, which is based on a paper by Merrifield and Saari (Merrifield is an astronomer and Saari is a prominent social choice theorist). Vohra gives a concise summary of the proposed method:
1) There are N agents each of whom submits a proposal.
2) Each agent receives m < N proposals to review (not their own).
3) Each agent compiles a ranked list of the m she receives, placing them in the order in which she thinks the community would rank them, not her personal preferences.
4) The N ranked lists are combined to produce an optimized global list ranking all N applications.
5) Failure to carry out this refereeing duty by a set deadline leads to automatic rejection of one’s proposal.
6) Individual rankings are compared to the positions of the same m applications in the globally-optimized list. If both lists appear in approximately the same order, then proposer is rewarded by having his proposal moved a modest number of places up the final list relative to those from proposers who have not matched the community’s view as well.
An interesting detail worth mentioning is that in step 4, the lists are combined using a well-known voting rule called Borda count: Each agent awards m-k points to the proposal ranked in the k’th position. The purpose of the controversial step 6 is to incentivize good reviewing; but even Merrified and Saari worry that “this procedure will drive the allocation process toward conservative mediocrity, since referees will be dissuaded from supporting a long-shot application if they think it will put them out of line with the consensus.”
Overall I think the proposal is interesting, and certainly could be superior to the status quo. The main criticism that immediately comes to mind was also expressed by Vohra: The proposed method feels somewhat unprincipled, in that it gives no theoretical guarantees. But perhaps such a method that is actually implemented is better than impossibility results that lead nowhere — and it could be refined over time. Because criticizing is fun (and sometimes even helpful), here are two more points.
First, strategic manipulation — the bane of social choice. NSF panels are populated by people who did not submit proposals, so the incentive for manipulation is small. Under the proposed system it is not clear that you cannot increase the chances of your own proposal by placing good proposals at the bottom of your ranking. The problem is similar to classic social choice questions, but in a setting where the set of agents and set of alternatives coincide; one would want the method to be impartial, in the sense that whether your proposal is funded is independent of your submitted ranking. Recent papers deal with very similar problems: Holzman and Moulin have an Econometrica 2013 paper where the goal is to impartially select a single alternative rather than ranking the alternatives; and my TARK’11 paper with Alon, Fischer, and Tennenholtz looks at the problem of impartially selecting a subset of alternatives (we actually discuss an application to peer review in Section 5). Our mechanism is simple: If you want to select k agents, randomly partition the agents into t subsets, and select the best k/t agents from each subset only based on votes from other subsets. I know that Holzman and Moulin have tried to design (axiomatically) good mechanisms for impartial ranking and ran into impossibilities, but is ranking actually necessary? NSF just needs to select a subset (of known size) of proposals to be funded. Would I recommend our mechanism, as is, to NSF? Umm, no. But the ideas and approach could be useful.
Second, randomness. Of course, peer review is random, but usually the randomness is not so explicit. Under the current system, a panelist who happens to get a batch of bad proposals can give bad scores and reviews to all of them. Under the proposed system, an agent who is randomly assigned a batch of bad proposals has to rank one of them on top, thereby giving it m-1 points — the same number of points a proposal that is judged to be fantastic would get from beating a batch of good proposals! And there are no reviews, so essentially the number of points awarded by each reviewer is all that matters. Moreover, it is very difficult to rank a batch of bad proposals, so if you get an unlucky batch you would probably lose your potential bonus from step 6 of the proposed method. The obvious way to ameliorate these issues it to use a large value of m. The NSF program uses m=7, which on the one hand seems too small to prevent randomness from playing a pivotal role, and on the other hand seems very large as each proposal is read by seven people (instead of just four today). Increasing m would make the reviewing load unbearable.
That said, kudos to the NSF for experimenting with new peer review methods. I expect that now the NSF will have to find a way to review a wave of proposals about how to review NSF proposals.
Update June 11 10:10am EST: Here is one possible solution to the strategic manipulation problem, which gives some kind of theoretical guarantee. The desired property is that a manipulator cannot improve the position of his own proposal in the global ranking by reporting a ranking of the m proposals in his batch that is different from the ranking of these m proposals based on others’ votes. Agents are penalized for bad reviewing according to the maximum displacement distance between their ranking and the others’ aggregate ranking, i.e., the maximum difference between the position of a proposal in one ranking and the position of the same proposal in the other ranking. By pushing a proposal x down in his ranking, a manipulator’s own score in the global ranking (modified by the penalty) decreases by at least as many points as x’s Borda score in the global ranking, therefore the manipulator’s position in the global ranking can only decrease.
Here’s what I wrote to someone who asked me about this:
It’s nice that they’re trying, but the incentive to agree with others and lack of discussion phase can get you into some bad equilibria. You have negative incentive to stand up for something unusual if you think the rest of the community won’t like it. You have negative incentive to downgrade a proposal for having already been done by others if you don’t think other reviewers are aware of this. You have no mechanism to make the other reviewers aware of this. And you could in principle even get into bizarre equilibria where everyone likes topic X but it’s common knowledge X gets bad reviews, so everyone has an incentive to continue giving X bad reviews.
Of course the counterargument is that these incentives are not so big, but that leads to other trouble. Given funding rates, it seems you have an incentive to rank weak proposals above strong ones since assuming you have any chance, you’re competing with the latter only.
Finally, not everyone submitting a proposal is a qualified reviewer.
Agreed. Merrifield and Saari address your last point in their paper, although I don’t find their “response” convincing:
“A related concern is how more junior applicants such as students will be able to handle the refereeing requirement with little past experience. However, since the purpose here is to integrate the refereeing into the application process, there is a natural mechanism for mentoring such applicants: just as more senior applicants on a proposal would be expected to provide input into writing a convincing case, so they would be expected to help train their more junior colleague in the assessment part of the process.”
*Junior* is not what I am concerned about…
(That being said, I agree it’s good of them to try new things. But not giving any feedback or having any discussion phase seems very problematic.)
A few years ago I saw a paper submitted to some conference (it wasn’t accepted, I think) that wanted to incentive consensus by first using the deGroot model to simulate a discussion phase and find a consensus, and then score reviewers based on how close their scores were to the consensus.
I can’t remember any more details. However it seems this is a never-ending topic of discussion. Maybe we should have a “petit challenge” for the algorithmic economics community to work on this problem. It is not as frivolous as it sounds.
In case this is of interest: the (pitifully small) amounts of funding doled out in NZ by the sort-of-equivalent of NSF are done this way: reject 80% based on a CV and 1-page abstract, then reject 50% of the remaining applications after asking them to put in a full application. As far as I can tell, the decision at the first step is done by Borda scoring plus some discussion in secret. The amazing thing is, I can’t seem to convince those in charge to release the total score to the applicant. Knowing how far away you are from the cutoff should be useful information, not contravene any privacy rules, and not pose an undue reporting burden. This only adds to my suspicion that the secret discussion part is not as above board as we would like. Personally, I would be happy to see more randomization used – as far as I can tell, only a few applications are clear accept or reject at the first stage, and a lot of energy is used trying to make fine distinctions based on very little information.
Another line of research related to this NSF mechanism are peer prediction mechanisms (also referred to as information elicitation without verification) that explicitly handle those settings where a reviewer is aware that her opinion is a minority opinion. That is, in contrast to the NSF mechanism, peer prediction mechanisms do not simply reward agreement. For example, in a simple binary setting with only accept/reject as possible scores, all that is required is that opinions are positively correlated, so that if reviewer 1 believes that the paper should be accepted, then her belief that reviewer 2 reports acceptance is higher than if reviewer 1 believed that the paper should be rejected. (For non-binary settings, there are similar assumptions.)
A subclass of peer prediction mechanisms actually gives reviewers an explicit way to report that they believe to be in the minority. In these Bayesian Truth Serum mechanisms, reviewers make two reports: 1. the actual score of the paper (e.g. accept or reject) and 2. a prediction of the scores given by other reviewers about the same paper. In the binary setting with “accept” and “reject”, for example, this prediction report would be the answer to the question “What is your belief that another randomly-chosen reviewer of the same paper voted for acceptance?”
Peer prediction also handles the randomness problem Ariel mentioned by using scores instead of rankings, so if a reviewer receives a bad batch of papers it simply reports lower scores on all of them. I think it would be very interesting to extend current peer prediction mechanisms to the NSF setting where the set of reviewers and papers coincide.
Yes, this connection occurred to me, but the peer prediction methods that I had in mind (like the Bayesian truth serum and, I think, your own work as well) rely on a notion of score that one wants to maximize. It seems very difficult to translate this score into a reward that works in the “agents and alternatives coincide” setting, where we can only reward agents by changing their positions in the global ranking.
Hmm, but see the update at the end of the post.
Couldn’t reviewrs’ “prize” be, not a function of agreement with other reviewrs, but on the number of cittions?
If the intention is to award good reviewing, then the quality of the research reviewed should be the measure.
A couple of quick comments:
One possibility I have thought about would be to introduce a comment phase before voting, in which each referee can circulate anonymized comments to the group looking at a particular proposal. This would give the opportunity to make sure that others have the same important information that you do, such as “I happen to know that X did this in 1977 — here’s the reference” to avoid some of the issues raised above. This would make the process rather more like a conventional review panel, but without the airfares, and with the workload distributed more equitably.
On the reward for predicting strong citations, aside from the time-lag issue, my experience has been that some of the best proposals are those for which I cannot predict the likely citations: if it pays off, it will be thousands, if it turns out to be a dead end, very few. Referees will therefore tend to err on the side of caution with such applications and opt for “money in the bank” applications that are bound to return a healthy crop of citations, thus elminating the very proposals we should be funding because the answer is truly unknown in an interesting way.
Thanks for your input Michael. The comment phase is a very good idea. As far as I know NSF is not planning to implement it though.
Dear Ariel
Yes, Penn Integrates Knowledge appears ungainly. The acronym, PIK, rolls off the tongue easily enough but brings to mind the image of a dental instrument.
To more serious matters. On proposal review, an important thing to keep in mind is the message space. One is not constrained to use a small message space like a summary score or single ranking. I believe that some of the concerns raised about the Merrifield and Saari proposal can be addressed through the choice of the message space.
rakesh
I agree with the Professor of Knowledge who will soon be integrated by Penn that a richer message space is desirable; natural language reviews would still seem to be the most natural. The question is how to use this extra information in determining rewards. One option is simply not to do so — just let reviewers report additional information that has no effect on the mechanism’s outcome. Can’t hurt — but a more effective approach may be to reward reviewers for sharing information that moves other reviewers’ scores (in the right direction). I’ve speculated a little bit about how one might do something like this in Section 5 of this paper: http://www.cs.duke.edu/~conitzer/predictionUAI09.pdf
More creative use of peer reviews has always had strong appeal for me. As an Air Force officer, active & reserve (1967-1995) I felt that our annual performance appraisal process was deeply flawed, coupled with the “up or out” requirement. Putting your career prospects in the hands of one senior officer worked OK in some cases but not so well in others, as all unit commanders were not created equal. My recommendation was to balance the commanding officer’s evaluation with a peer review process in which each officer, at a particular grade level, would evaluate his/her peers at that grade level within that squadron or staff element: “excluding yourself, who in this organization deserves promotion?” I thought that would result in a much more “fair and balanced” assessment of every officer.