Academics fooled by computer gibberish

April 08, 2014 0 Reaction

1890s collectors’ cards featuring events in ballooning history/Library of Congress Here is what happened when some researchers started using a bafflegab generator in earnest. Recently, a major science publisher had to withdraw more than 120 papers that were just gibberish— random, grammatically correct noise:

Over the past two years, computer scientist Cyril Labbé of Joseph Fourier University in Grenoble, France, has catalogued computer-generated papers that made it into more than 30 published conference proceedings between 2008 and 2013. Sixteen appeared in publications by Springer, which is headquartered in Heidelberg, Germany, and more than 100 were published by the Institute of Electrical and Electronic Engineers (IEEE), based in New York. Both publishers, which were privately informed by Labbé, say that they are now removing the papers.

Let me be clear about this. The problem is not that the papers sound like gibberish to a person unfamiliar with the field. They are gibberish.
Recall the bafflegab generators highlighted last Wednesday? A group of computer nerds produced the SCIgen fake paper generator, using a similar system, in order to conduct a test.
A physicist friend, Rob Sheldon, offers lay-friendly explanation: This type of sentence generator is called a “mad-libs” template, that is, a template (often used in education) where the sentence structure, as such, is grammatical but words are chosen by the user from a suitable list. In this case,

… blanks are left for “scientific_adjective”, “noun-for-process”, etc. Glossaries of 50 or 100 words are supplied for these adjectives and nouns, and then the paper is constructed by filling the blanks randomly. So the grammar is correct, even the logic is correct, it is just that the content is made up. The code for this generator was made by three grad students at MIT in 2005, and originally the blanks were all “computer-science” words. The references are constructed likewise.

The grad students hoped to demonstrate that many computer science meetings existed to make money off registrations and that the quality of the papers, later submitted for publication, was irrelevant. Then others started using the system. Simply getting a paper into an esteemed journal advances one’s career, and these mad-libs papers can help raise a publication count. As Sheldon notes,

What is scary, is that many of these journals that accepted the papers–IEEE and Springer-Verlag–are respected and “peer reviewed” journals.

And

The Labbe paper showed that you could also fool the Google Scholar metrics with what is called a “quote farm”. Google tracks how many people quote you, and follows the chain of quotes a couple of references backward. So Labbe and his wife created 100 SCIgen papers, carefully putting the other 99 papers in the reference of each. They didn’t need to publish them, because Google Scholar just pulls them off the internet. Thus a closed universe of self-quoting papers was created. When the Google metrics hit this, they ran around in circles finding out that “Dr Antkare” was being quoted by everyone!

Labbe also discovered that when they used the “More Like This One” feature button on the web browsers developed by Google or Nature or whoever, they could feed in SCIgen papers they had made and found numerous others in the literature! Over 120 papers were found this way, which they then dutifully told the publishers were machine generated.

One weakness of the SCIgen generator is its invariable sentence structure, but later generations of grad students can probably generate varied templates at random as well. That’s a question of time and motivation. Someone was working recently on a nonsense papers generator for high energy particle physics as well (the snarXiv, a play on arXiv, a conventional repository for papers). Sheldon adds,

… the ease with which these papers can be created, combined with the apparent difficulty for reviewers to recognize them, doesn’t bode well for any of these fields. It looks like there are far more charlatans than we thought, or conversely, the economic benefits of securing tenure far outweigh the punishment for getting caught.

Does this mean that computers will run the world soon? Certainly not. As soon as an intelligent person who understands the field starts to read the paper, it becomes instantly obvious that it makes no sense. The computer is generating grammatically correct sentences, not meaningful information. But the reality is that few ever read the papers, and that is how the “authors” get away with it.
Technically, one could pull this type of scam without a computer (using jars of words written on tickets, for example). But a computer is much faster, easier, and neater. Also, using a computer rules out the risk that anyone might accidentally create meaning. Risk? Yes, because a meaningful sentence can express a mistaken view of a factual question. Gibberish, by contrast, cannot be mistaken. In many ways, it is safer.
Seriously, when the idea that the universe is inherently meaningless gravitates from the arts to the sciences, it produces, not existentialist art and literature but … random noise posing as meaning. See also: Why would anyone want to understand information theory? For one thing, information theory helps us distinguish signal (meaning) from noise (in the context, what the computer is generating)
The day the computer does make the rules, you will feel really well advised ;) :

Denyse O’Leary is a Canadian journalist, author, and blogger.

Liquid syntax error: Error in tag 'subpage' - No such page slug home-signup

Like what you are reading?

Here are some more articles you may enjoy