Evaluation of scientific research is notoriously hard, almost by definition: success means something not done before, but if it was not done before, then how can we evaluate it? We are now seeing everywhere a huge shift in how this is done: it used to be that we designated some person as the authority for such judgement and then asked him (only rarely was it “her”): the chair of the anthropology department (say) decided which anthropologist to hire, hopefully after getting authoritative “letters” from other anthropologists. Similarly, some committee of Full Professors used their authority judgment to decide which departments in the university should grow, which research projects should be funded, or which universities in a country to pour money into. Instead, we are increasingly seeing a huge shift into evaluating everything by numbers: counting publications, journal rankings, impact factors, citations, H factors, grant amounts, numeric assessment exercises, and so on, attempting to base more and more scientific evaluations on such measures.
Many academics are viciously opposed to this numeric evaluation trend: they point correctly to various errors and short-comings of the numeric measures; they point out that in order to reasonably evaluate work in anthropology (say) you must understand this type of work; they object to popularity contests determining scientific progress; and they fear that the mechanical-economic-political control that these bibliometric approaches bring into the process of scientific discovery, will bias it away from the “true” path.
I agree with many of these criticisms of bibliometric evaluation, however I have to admit that I see the authority-based system as even more problematic. The local authorities are people and as such they have huge biases: obviously they share various common human ethnic/religious/gender biases — biases absent from non-viciously-chosen bibliometric measures. Even more troublesome are personal biases: we all tend to think that our friends and students are smarter than others that we don’t yet personally know. Worst of all are scientific biases: the authorities tend to defend their scientific turf against new approaches, new fields, competing ideas, and so on. The combination of these natural biases encourages inbreeding, conformity, and discourages change and innovation — the very things academia should encourage. This bites especially hard in two cases where it counts most: the first case is if we happen to start with weak authorities (as the quadratic law of hiring says: 1st rate people hire 1st rate people, but 2nd rate people hire 4th rate people), and the second case is areas which are in rapid flux. It is not that numeric systems are completely free from all these problems since ultimately they are based on human authorities too (those who accept papers for publication, cite them, or award grants) but as they are more global, more transparent, and involve a larger number of people, they are less prone to biases, and may support faster adoption of change. Additionally there is the issue of minimizing self-interest: Professors can not be trusted to evaluate themselves any more (or less) than any other segment of the population can. I can not see a single academic department whose self-evaluation will be “we stink — don’t give us any money”, even though I can see many that will get this evaluation by any non-self-interested evaluation method used.
Additionally, measuring scientific success in a way that can be understood and believed from the outside is simply unavoidable. The taxpayers are being called to fund scientific research in ever increasing amounts. They want to know why. Frankly, society will not continue funding it (i.e. us academics) if we do not make a convincing argument why supporting research is preferable to improving elementary schools, building roads, increasing minimum wage, or reducing taxes. This question will be asked in one way or another by every politician who needs to allocate the money and in every political system. Luckily, there are excellent answers demonstrating the human value of academic research, innovation, and critical thinking, with examples ranging from the Socratic method to the Internet. These answers seem to be quite effective and in many places the political system has indeed been convinced that pouring more money into academic research is a good idea. But, very few administrators will be convinced that a carte blanche is called for. In fact the more academic research is deemed to be “useful” (whether practically or just in the sense of advancing humanity’s quest for truth and beauty) the more emphasis will be put on its measurement and “optimization”. The close ties of academic research and education and the ever increasing numbers of college and university students only strengthen this further. The charm of bibliometric/numeric measurements in comparison with authority based ones is that its harder to corrupt them or politically bias them, as the latter very often are.
So, given that we are moving to more mechanical forms of evaluating academic excellence, how can we avoid most of the pitfalls? This question applies at all levels from the evaluation of single candidates for appointment or tenure, to funding departments within an organization, to national policy. This is really a critical question in our day and age, where entire countries are overhauling the way that they evaluate their academic research. Famously and transparently this is the case for Great Britain, but it is also true in a large sense for China, as well as my own country, Israel, and many other nations (at various levels of explicitness). These changes may turn out to have profound implications on human progress, whether good or bad, we surely do not yet know, but should definitely aim for the good.
So here are some suggestions on how to sanely use bibliometric/mechanical evaluations:
- Don’t be silly: Don’t use indications that obviously do not make sense. If in some field of knowledge publication of books or presentations in conferences are de facto a stronger indication of excellence than journal publicaions, then counting only the latter is simply silly.
- Measure quality, not quantity: Do not count papers, count citations. Do not count number of faculty members, count number of award-winning faculty members. Do not count Ph.D.s produced, count Ph.D. that got good positions.
- Measure strategically: Count what is harder to artificially fake; scarcity and competition is some signal of quality; ease of measurement matters. Self-citations are not an indication of influence; self-published books are not an indication of excellence; excellence of a conference or journal is negatively correlated with its acceptance rate and positively with the number of attendees or readers.
- Measure globally: There is no reason why excellence of a computer scientist, philosopher, or almost any other researcher should be judged differently in different institutes or different countries. If you are from a small country and publish in your own country — that is usually not a sign of excellence. (It is usually not a good sign if French researchers publish in French or Chinese one in Chinese.) Pick two random universities A and B with their different evaluation systems — in most cases, using B’s system to evaluate A gives more honest indications than using A’s system.
- Measure widely: There are many different ways to count anything such or university rankings or citation counts. (E.g. web-based sites that offer easy to access citation counts in CS include at least Google Scholar, arnetminer, citeseer, Microsoft academic search, as well doubtless others.) They differ from each other in many parameters of what they count and even what they aim to capture. The average of several (reasonable) citation counts is usually preferable to any single one. As non-numeric committees always do, it is best to take into account as many indicators of excellence as possible (e.g. citation counts, publication-venue ranking, prizes, grants, esteemed academic responsibilities, …)
- Vary your measurements: Occasionally varying the technical details of what you measure makes it harder for the evaluated to strategically game the system, and allows the gradual optimization of the evaluation process.
- Judgement and responsibility remain with people: Any numerical measurement is only a proxy for excellence and not the thing itself. If you recoginize excellence that is not captured by the numerical system that you are using, then it is still your responsibility to use your judgement. This ofcourse should be rare, should come with a strong explanation, and may use-up some of your professional credit, but the responsibility always remains with the people making the decision.
While it is not perhaps of value for making invidious comparisons of an individual persons work I have found: http://www.scirus.com/
of use in getting information about a particular person’s work.
The recently released and justifiably-maligned NRC rankings of CS departments were based on use of a variety of metrics especially including bibliometrics. They were clearly screwed up (using only ISI citations for example) but they point out the problems of “pseudo-authorative” use of such metrics. How do you guard against this kind of thing? How do you judge the measures?
It seems that the NRC rankings fail my “don’t be silly” rule….
There is never a guarentee that people don’t do silly things. The good thing is that the NRC silliness was dicovered and taken into account very quickly and will likely be fixed in the future. Stupid authorities may be as silly but are hardly ever found out so quickly…
Noam,
the observation that acceptance rates are inversely correlated with quality would trash most of theoryCS relative to other communities within CS. The deeper point I guess is that calibrating metrics is tricky, and defining the community that one should calibrate against is also difficult as well as subject to gaming.
Very nice to see such a lucid post on such a “charged” topic. I think the main issue with objective criteria lies in the feedback loop. We can typically find criteria that are well correlated with what we want. But if we announce that we’re going to use them, people start “optimizing” and putting pressure on the correlation. (For example, in CS, we hear horror stories from subfields that have over-committed to one central conference, and in the process destroyed the dynamics that made that conference great.)
Perhaps there is some economic argument to be made about the value of a hidden optimization function: change the criteria regularly (eg every few years), promising to keep them reasonable.
I agree with Mihai. Peraphs one shoud recall Goodhart’s law, that states that “once a social or economic indicator or other surrogate measure is made a target for the purpose of conducting social or economic policy, then it will lose the information content that would qualify it to play such a role”.
[…] This post was mentioned on Twitter by Suresh Venkat, Stephen Kinsella, Gwendal Simon, Shih Chih Chuan, Jean Bezivin and others. Jean Bezivin said: RT @gwendal: about the good usage of bibliometrics (and the often forogtten bad bias of unbibliometric evaluation) http://is.gd/qZ6R7I […]
While I agree with the criticisms raised by Paul, Suresh, Mihai, and Ugo, I do not see them as arguments in the ongoing debate for/against bibliometrics, as I view that argument as already decided. Thus I view these as issues to solve or at least to mitigate.
My feeling is that Paul and Suresh’s issues can be rather easily solved reasonably well with streight forward care, and that the major difficulty is indeed Goodhart’s “law” and variants thereof. But, I don’t see this as a “law”: with care and ongoing effort “Goodhart’s problem” can be overcome. Indeed most of my “suggestions” are supposed to help address this problem, making it harder to manipulate the numbers.
Mechanism Design may be applicable….
By “already decided” I assume that you mean that if places like NRC are doing it then the pattern is fixed already. I agree. There probably is a “straight-forward” solution to improving some of this but, given the politics, one-size-fits-all solutions rather than discipline-specific ones seem more likely to be adopted by such agencies.
I don’t see your “vary the metrics” bullet as very likely to be adopted either. Maybe related to Goodhart’s problem is the fact that, once established, metrics have a tendency to become entrenched, whether or not they even had value in the first place. If you keep changing them then there is no way to measure “improvement”.
My comment about judging the metrics was more a matter of how one decides how to use metrics. The very fact that we use the plural in the term indicates the fact that these things are multidimensional. However, the tendency is to project things down to one dimension. As Malcolm Gladwell’s recent New Yorker article on US News college and car rankings reminds us, this is frought with biases and can be pernicious.
> one-size-fits-all solutions rather than
> discipline-specific ones seem more likely to
> be adopted
> once established, metrics have a tendency to
> become entrenched, whether or not they even
> had value in the first place
Both are correct but fixable with reasonable discourse and effort. I am hoping that if academics complained against specific fixable issues such as these (and suggested how to fix them) rather than attacking the whole idea itself, then the fixable things would be fixed and a reasonable system would emerge after a while.
Do you have more specific suggestions on how to use bibliometric parameters? The “best” (so to speak…) I came across is eigenfactor (http://www.eigenfactor.org/) that uses a ranking algorithm similar to Google’s. However, it ranks journals, not papers. It would be nice to have something similar for papers, and see if something sensible comes out of it.
Let’s approach this as a computational problem:
We have a hidden value namely, “quality of researcher”, a computational agent who is trying to determine this value, and an adversary who is trying to fool the computational agent.
To be more precise, the quality of the researcher is an amount of computational power Q available to the researcher which can be apportioned any way it wishes between the pursuit of academic excellence and the pursuit of inflated rankings.
The computational agent or ranker is trying to measure this value Q within a certain margin of error and using a sublinear (indeed logarithmic) amount of computational resources, where “linear time” is the amount of computational resources needed by the ranker to exhaustively evaluate the academic record of the researcher.
Lastly, since this is worst case analysis we assume that we have a malicious adversary, i.e. a fully corrupt researcher, who is trying to fake the measure. However, as indicated above the adversary is not all powerful. It has available a limited amount of computational power Q.
The goal of the computational bibliometric is to return a value no higher than Q, regardless of the strategy of the adversary, with high probability. Observe that so long as the measure by the ranker is no higher than Q regardless of the adversarial strategy, then it doesn’t pay to cheat.
In these terms, what Mihai suggests is a randomized algorithm to foil the adversary.
Ugo’s observation translates to a claim that no deterministic sublinear strategy by the ranker will ever succeed.
Paul’s second comment about entrenched metrics suggests that in this model random bits are very expensive.
An important issue is ‘non-classical/non-official’ contributions. Writing blogs, commenting on blog posts, asking&answering questions in math/stack/whatever-overflow, putting source code on the web for download, participating in a polymath project etc. can all be very valuable contributions which as far as I know are currently pretty much ignored in hiring/tenure/funding decisions. Any evaluation system (whether by authorities or by some mechanical metrics) should take these into consideration – this seems non trivial as new forms for contribution keep coming
Regarding some details:
“Measure quality, not quantity: Do not count papers, count citations. Do not count number of faculty members, count number of award-winning faculty members.”
Well, quantity is also important, and some measures that attempt to factor out quantity are very problematic (e.g. average citation for paper is a bad parameter, a better one would be average citation number for the top 10 papers.)
“Measure globally: There is no reason why excellence of a computer scientist, philosopher, or almost any other researcher should be judged differently in different institutes or different countries.”
Putting aside academic research which is very local by nature and can still be excellent and very necessary,
there are different legitimate and sometimes good fashions and trends in different places.
“obviously they share various common human ethnic/religious/gender biases — biases absent from non-viciously-chosen bibliometric measures.”
Not at all, these types of biases may be present in many of the bibliometric parameters and you will not have even a way to factor them out.
“The charm of bibliometric/numeric measurements in comparison with authority based ones is that its harder to corrupt them or politically bias them, as the latter very often are.”
Large noise-sensitivity of some of the measures, plus random and arbitrary elements might make the bibliometric method easier to corrupt and may lead to spontaneous corruptions of various types.
(This was my post, I have no idea how wordpress did not recognize me…)
One thing that I have to agree with you (and even in a stronger way than you are implying) is the silliness of citations/paper measure. I don’t see any reason why any measure of excellence would be decreasing with quantity. I don’t think that it should increase with quantity either (as paper counting makes it be), but making it decreasing is strange. (I can see scenarios where citations/person is reasonable, but even that should be done very carefully, e.g. was one of the major factors of why the recent NRC rankings were a mockery.
Except for the citations/paper point, it seems that you (as well as most other replies) are engaging in the old debate of for/against bibliometrics. My whole point is that we should leave that debate behind and focus on how to do bibliometrics as well as possible.
Dear Noam, two items were about the specific suggestions and two items were about the general argument. If you want to have a bibliometric system you have to take into account biases that are bulid in and tend to magnify in such a system.
It is not clear what would be the rationale for taxpayers or politicians or other people who support science to move away from a traditional method of evaluation to bibliometric-based method that ususally cannot be understood. So I doubt that this is the main reason for such a trend.
However, it is clear why this method is much better for us , it (partially) automatize the part of our professional duties most of us like the least and it does not decrease our ability to manipulate and bias when we really want to.
>what would be the rationale for taxpayers or
>politicians or other people who support science to
> move away from a traditional method of evaluation
> to bibliometric-based method?
Well it is a matter of trust and transparency. Just think of yourself sitting in some university level committee that needs to choose who to hire between the top rated candidate of departments A, B, or C (all areas in equal need of growing). Very often all three departments are exaggerating the quality of “their” candidate. Won’t you want some less-manipulable indicators of excellence? Not surprisingly once a professor becomes president or provost he usually starts liking bibliometrics, at least within his university..
> it is clear why this method is much better for us
We hate objective evaluations simply because everyone hates being evaluated. We do not like that our own research is evaluated nor do we like our judgments to be evaluated. We are just normal people: it’s not that we deliberately want to manipulate anything — we just want to go with our biases without questioning them.
Do you know of any tool that would count the papers and citations even remotely correctly for a person with non-ascii characters in his name?
I want to dispute almost everything you say.
The elephant in the room is the fact that some subjects simply aren’t as important as others. IMO the study of things like English literature (as studied in the english departments) is, like chess, an entertaining passtime that colleges might equally well choose to offer electives in for student entertainment, aren’t even valid academic disciplines let alone useful ones. Ohh yes a few hundred years ago they were essential so college graduates had a common set of cultural references to draw on but that cause is now better served by watching the simpsons than reading Shakespeare.
More importantly there is no compelling sense in which such disciplines can be said to build atop prior knowledge. The same can also be said about areas like theology. It might be the most important thing in the world but there is no convincing sense that our knowledge grows year by year.
Thus I would place at the very top of the list the extent to which the discipline/area is able to build upon past work as without that property there is no reason to suspect greater benefits from the subject in 10 years than now. Such a clear statement of principle would also have the advantage of pushing valuable areas of study (parts of analytic philosophy) to cast off the meaningless (continental philosophy) or already solved (all the people who figured out the surprise quiz paradox stop publishing on it) parts of the discipline.
This might be information that can be partially gleaned from citation databases but that doesn’t help us deal with the relative usefulness of genuine progressing academic programs. For instance while one could develop detailed theories and models of medeival war that genuienly build on prior research it’s probably just not as important as say working out more efficent solar cells.
In the end, however, it is mathematics which simply resists any attempt to assign relative long run importance to different equally elegant/extensible areas. The other sciences achieve their utility in relatively short time periods letting us form reasonable hypothesizes about what will and won’t be useful but the immense time scales that can pass between the discovery and application of various mathematical theories has yet to provide us with sufficient data for good predictions.
Ohh as to why I dispute almost everything you say it’s for the simple reason that it’s entierly self-referential.
The point being is that if a criticial mass of English profs get together and start publishing papers on whether various novels (great gatsby) are actually references to homosexuality it will pass every count based type of test even if it’s totally useless.
This is okay because we have no need of a global index of research value. We need two different things. We need a way for researchers to find articles they want to read without paging through a bunch of junk and we need a way of allocating funding.
Finding papers will eventually be done by amazon/netflix style recommendation engines and it simply doesn’t matter that different people may not have the same articles flagged as important.
Handing out money is done by many institutions (ensuring that we won’t accidently totally cut off something of value) and these institutions almost invariably first allocate money amoung the various kinds of research and then let members of that specialty decide who deserves the money. There is no real need for citation counting anywhere in this process except when hiring committees get too lazy to skim the papers (and that is usually for low level positions where quality can be inferred from advisor/recs). Citation counts can’t help because all they do is reflect the judgements of the very people making the hiring/funding deciscions in the first place.
Citation counts are all about prestige for universities and very very rough estimates and should be mostly forgotten about.
Pretty element of content. I just stumbled upon your site and in accession capital to say that I acquire
in fact enjoyed account your weblog posts. Anyway I’ll be subscribing on your feeds and even I success you get admission to persistently rapidly.
That is really fascinating, You are an excessively professional blogger.
I’ve joined your rss feed and look forward to looking for more of your excellent
post. Also, I have shared your website in my social networks