Evaluation of scientific research is notoriously hard, almost by definition: success means something not done before, but if it was not done before, then how can we evaluate it? We are now seeing everywhere a huge shift in how this is done: it used to be that we designated some person as the authority for such judgement and then asked him (only rarely was it “her”): the chair of the anthropology department (say) decided which anthropologist to hire, hopefully after getting authoritative ”letters” from other anthropologists. Similarly, some committee of Full Professors used their authority judgment to decide which departments in the university should grow, which research projects should be funded, or which universities in a country to pour money into. Instead, we are increasingly seeing a huge shift into evaluating everything by numbers: counting publications, journal rankings, impact factors, citations, H factors, grant amounts, numeric assessment exercises, and so on, attempting to base more and more scientific evaluations on such measures.
Many academics are viciously opposed to this numeric evaluation trend: they point correctly to various errors and short-comings of the numeric measures; they point out that in order to reasonably evaluate work in anthropology (say) you must understand this type of work; they object to popularity contests determining scientific progress; and they fear that the mechanical-economic-political control that these bibliometric approaches bring into the process of scientific discovery, will bias it away from the “true” path.
I agree with many of these criticisms of bibliometric evaluation, however I have to admit that I see the authority-based system as even more problematic. The local authorities are people and as such they have huge biases: obviously they share various common human ethnic/religious/gender biases — biases absent from non-viciously-chosen bibliometric measures. Even more troublesome are personal biases: we all tend to think that our friends and students are smarter than others that we don’t yet personally know. Worst of all are scientific biases: the authorities tend to defend their scientific turf against new approaches, new fields, competing ideas, and so on. The combination of these natural biases encourages inbreeding, conformity, and discourages change and innovation — the very things academia should encourage. This bites especially hard in two cases where it counts most: the first case is if we happen to start with weak authorities (as the quadratic law of hiring says: 1st rate people hire 1st rate people, but 2nd rate people hire 4th rate people), and the second case is areas which are in rapid flux. It is not that numeric systems are completely free from all these problems since ultimately they are based on human authorities too (those who accept papers for publication, cite them, or award grants) but as they are more global, more transparent, and involve a larger number of people, they are less prone to biases, and may support faster adoption of change. Additionally there is the issue of minimizing self-interest: Professors can not be trusted to evaluate themselves any more (or less) than any other segment of the population can. I can not see a single academic department whose self-evaluation will be “we stink — don’t give us any money”, even though I can see many that will get this evaluation by any non-self-interested evaluation method used.
Additionally, measuring scientific success in a way that can be understood and believed from the outside is simply unavoidable. The taxpayers are being called to fund scientific research in ever increasing amounts. They want to know why. Frankly, society will not continue funding it (i.e. us academics) if we do not make a convincing argument why supporting research is preferable to improving elementary schools, building roads, increasing minimum wage, or reducing taxes. This question will be asked in one way or another by every politician who needs to allocate the money and in every political system. Luckily, there are excellent answers demonstrating the human value of academic research, innovation, and critical thinking, with examples ranging from the Socratic method to the Internet. These answers seem to be quite effective and in many places the political system has indeed been convinced that pouring more money into academic research is a good idea. But, very few administrators will be convinced that a carte blanche is called for. In fact the more academic research is deemed to be “useful” (whether practically or just in the sense of advancing humanity’s quest for truth and beauty) the more emphasis will be put on its measurement and “optimization”. The close ties of academic research and education and the ever increasing numbers of college and university students only strengthen this further. The charm of bibliometric/numeric measurements in comparison with authority based ones is that its harder to corrupt them or politically bias them, as the latter very often are.
So, given that we are moving to more mechanical forms of evaluating academic excellence, how can we avoid most of the pitfalls? This question applies at all levels from the evaluation of single candidates for appointment or tenure, to funding departments within an organization, to national policy. This is really a critical question in our day and age, where entire countries are overhauling the way that they evaluate their academic research. Famously and transparently this is the case for Great Britain, but it is also true in a large sense for China, as well as my own country, Israel, and many other nations (at various levels of explicitness). These changes may turn out to have profound implications on human progress, whether good or bad, we surely do not yet know, but should definitely aim for the good.
So here are some suggestions on how to sanely use bibliometric/mechanical evaluations:
- Don’t be silly: Don’t use indications that obviously do not make sense. If in some field of knowledge publication of books or presentations in conferences are de facto a stronger indication of excellence than journal publicaions, then counting only the latter is simply silly.
- Measure quality, not quantity: Do not count papers, count citations. Do not count number of faculty members, count number of award-winning faculty members. Do not count Ph.D.s produced, count Ph.D. that got good positions.
- Measure strategically: Count what is harder to artificially fake; scarcity and competition is some signal of quality; ease of measurement matters. Self-citations are not an indication of influence; self-published books are not an indication of excellence; excellence of a conference or journal is negatively correlated with its acceptance rate and positively with the number of attendees or readers.
- Measure globally: There is no reason why excellence of a computer scientist, philosopher, or almost any other researcher should be judged differently in different institutes or different countries. If you are from a small country and publish in your own country — that is usually not a sign of excellence. (It is usually not a good sign if French researchers publish in French or Chinese one in Chinese.) Pick two random universities A and B with their different evaluation systems — in most cases, using B’s system to evaluate A gives more honest indications than using A’s system.
- Measure widely: There are many different ways to count anything such or university rankings or citation counts. (E.g. web-based sites that offer easy to access citation counts in CS include at least Google Scholar, arnetminer, citeseer, Microsoft academic search, as well doubtless others.) They differ from each other in many parameters of what they count and even what they aim to capture. The average of several (reasonable) citation counts is usually preferable to any single one. As non-numeric committees always do, it is best to take into account as many indicators of excellence as possible (e.g. citation counts, publication-venue ranking, prizes, grants, esteemed academic responsibilities, …)
- Vary your measurements: Occasionally varying the technical details of what you measure makes it harder for the evaluated to strategically game the system, and allows the gradual optimization of the evaluation process.
- Judgement and responsibility remain with people: Any numerical measurement is only a proxy for excellence and not the thing itself. If you recoginize excellence that is not captured by the numerical system that you are using, then it is still your responsibility to use your judgement. This ofcourse should be rare, should come with a strong explanation, and may use-up some of your professional credit, but the responsibility always remains with the people making the decision.