Tuesday, 4 February 2014

The case for emancipating articles from their authors

In August 1795, the composer Joseph Haydn, on his way back from England, stopped at Passau, where he listened to a performance of one of his most famous works, the Seven Last Words of Christ. He had written this work in 1786, at the request of a Spanish canon, who wanted orchestral music for accompanying meditation during church services. At Passau however, the local Kapellmeister, Joseph Friebert, had added a choir, and made the work into a cantata. Haydn himself had already done a transcription for string quartet, and approved another transcription for piano -- but he had never thought about a cantata.

So, how did Haydn react, after his work was modified without his authorization?
Well, he found the result quite good, but thought he could do a bit better with the choral parts. So, he asked Friebert for his score, and started reworking it, modifying not only the choral parts, but also his own orchestral parts, and even adding a few bits -- including a new orchestral interlude, which was then acclaimed as one of his most perfect works. Meanwhile, the text of the oratorio, which was orginially adapted by Friebert from contemporary poems, was further reworked by Gottfried van Swieten at Haydn's request, a reworking which sometimes involved directly copying pieces of the original poems.

Expecting unexpected contributions.

The story of Haydn's cantata illustrates the principle that you never know in advance who may be willing, or able, to make valuable contributions to a work. And of course this applies to scientific research. Pretty much any reader of an article will think of improvements, from formatting issues to the main ideas or connections with other works. Wikipedia and the Polymath project have shown what can be done when the public as a whole is allowed to contribute to a project, even when in the end only a few people make significant contributions to a specific piece of work.

However, in the case of scientific articles, the problem is that only the authors are allowed to contribute. A reader who thinks of an improvement has to contact the authors, and wait for them to do something. This is inefficient and time-consuming, when at all possible -- after publication in a scientific journal, articles are rarely modified. As a result, readers seldom bother to contact authors, and most of the improvements they can think of are never made. Having to go through authors is inefficient for journal reviewers too, and if the article is eventually rejected, the reviewers' work is entirely lost to the public.

Expertise is overrated.

Limiting authorship to, well, the authors, is especially wasteful, because many types of contributions could  be done by non-experts.
A historian friend found a mistake in a formula in my thesis, and when secretaries used to type articles, they too sometimes found mistakes in formulas. A graduate student who struggled with an explanation can be in a better position to improve it than the authors themselves. Web robots might be able to find and correct mistakes by comparing many articles.

So, an important part of the work which goes into an article requires much fewer qualifications than the authors typically have. On the other hand, many academics are good at research and terrible at writing - or more specifically, at writing in English. As long as they cannot efficiently obtain outside help, we will have to live with the low standards of clarity which currently prevail. And these low standards may contribute to the fact that so much of the published literature is wrong, useless, or worse.

Moreover, good articles still have readers long after they are published, and hindsight suggests improvements. The authors may not be willing, or able, to act as curators of their articles for an unspecified duration. And why would they bother? The public is the only possible curator, the one whose motivation is proportional to the usefulness of the work to be done.

What about wikification?

I have been assuming that scientific articles are worth being improved. But is this true? After all, we currently have an inflationary bubble where the numbers of articles and citations grow, while the substance of articles and their life expectancy shrink. However, the idea is not to improve all articles, but to let the public decide which ones deserve extra work.

Another objection to improving articles is that it might be more profitable to invest effort in wikifying scientific knowledge. However, improving existing texts is much easier than writing new ones for a different medium. And improving articles would be a very useful transitional step towards open collaboration and the wikifying of science. 

Technical issues.

Before modifying articles, we may want to comment and to annotate them. A useful infrastructure for gathering comments on articles is the Selected Papers Network. Hopefully the publicly available comments will one day include the reports from journals' reviewers.

To go further and modify articles, we need to overcome some legal hurdles. We do not live in Haydn's time any more, and inventions such as copyright automatically kick in when a text is made public. To make collaborative improvements legally possible, the authors need to explicitly use an appropriate Creative Commons license.

Then, how do we prevent edit wars, allow for modifications to be undone, track all contributions, and allow for several versions of a text to coexist, including an authors-approved canonical version? This technical problem was solved a long time ago, and one solution is called GitHub. GitHub is so far mainly used for collaborative computer code development, but the underlying version control system, Git, can work on any text file.

How it might work.

Assume the authors write an initial version A1. A kind soul finds some typographical errors, and writes version B where they are corrected. Independently, another reader thinks he can clarify some explanations, and writes version C. A third reader adds references to his own articles on the subject, in version D. Meanwhile, the authors themselves make improvements, and produce version A2. They notice the other versions, and merge their version A2 with versions B and C, while leaving out version D's modifications, which they don't like. This yields version A3. The writer of version D merges it with version A3, and obtains version E which incorporates all proposed changes. The resulting tree of versions looks as follows:
It is not clear how the contributors who are not the original authors should be acknowledged. In any case, it is easy to know who did what.

I plan to make an experiment in a few weeks with my next article, which will be a review article on conformal field theory. The plan is first to write the article in LaTeX and put in on arXiv, as is usually done in my field of research. The experimental part is to have the article on GitHub as well, and to put it in the public domain. It is not clear to me whether similar experiments have been done before, and whether the "GitHub / public domain" combination is the best choice of technical tools.

Acknowledgement.

I am grateful to Louise Ribault for drawing my attention to Andreas Friesenhagen's text on the Seven Last Words of Christ, which I summarized at the beginning of this post, and for stylistic suggestions.