Dear reader, I am aware that I have been a poor correspondent in recent weeks, but in truth I have been doing something I should have done long ago: gaining some experience of AI companies, talking to their potential customers and reading a book. Lets start at the end and work backwards.
The book that has eaten the last week of my life is Edward Wilson-Lee’s fine new publication,The Catalogue of Shipwrecked Books, which describes the eventful life of Christopher Columbus’ illegitimate son, Hernando, and his attempts to build a universal library of human knowledge. Hernando collected printed works, including pamphlets and short works, in an age when many Scholars then still regarded all print as meretricious rubbish. He built a catalogue of his collection, and then realised that he could not search it effectively unless he knew what was in the books, so started compiling summaries – epitomes – and then subject indexing, as well as inventing hieroglyphs to describe the physical properties. In other words, in the 1520s in Seville he built an elaborate metadata environment, but was eventually defeated by the avalanche of new books pouring out of the presses of Venice and Nuremburg and Paris. Wilson-Lee very properly draws many parallels with the early days of the Internet and the Web.
As i closed this wonderful book, my mind went back to an MIT Media Lab talk in 1985 given by Marvin Minsky. We need reminding how long the central ideas of AI have been with us. At the end of his talk, the Father of AI kindly took questions, and a tame librarian in the front row asked “Professor, If you were looking back from some inconceivably distant date, like, say, 2020, what would surprize you that you have in 2020 but which we do not have now?“. After a thoughtful moment, the great man replies “Well, I guess that I would praise your wonderful libraries, but still be surprized that none of the books spoke to each other“. At that he left the room, but from then the idea of books interrogating books, updating each other and creating fresh metadata and then fresh knowledge in the process of interaction has been part of my own Turing test. So I find it easy to say that we do not have much AI in what we call the information industry. We have a meaningless PR AI, a sort of magic dust we sprinkle liberally (AI-enhanced, AI-driven, AI- enabled etc) but few things pass the “books speaking to books and realising things not known before“ test.
And yet we can and we will. The key questions are, however: will current knowledge ownership permit this without a struggle, and will there be a dispute over the ownership of the results of these interactions? This battle is already shaping up in academic and commercial research, so it was dispiriting to find when talking to AI companies that it seems there is really no business model in place yet enabling co-operation. Partly this is a problem of perception. Owners and publishers see the AI players as technicians adding another tier of value under contract – and then going away again. The AI software developers see themselves as partners, developing an entirely new generation of knowledge engine. And neither of them will really get anywhere until we all begin to accept the implications of the fact that no one, not even Elsevier, as enough stuff in one place to make it work at scale. And while one can imagine real AI in broad niches — Life Sciences – the same still applies. And if we try it in narrow niches, how do we know that we have fully covered the crossovers into other disciplines which have been so illuminating for researchers in this generation? In our agriscience intelligent system how much do we include on food packaging, or consumer market research, or plant diseases, or pricing data?
So what happens next? In the short term it is easy to envisage branded AI – Elsevier AI, Springer Nature AI? I am not sure where this gets us. In the medium term I certainly hope to see some data sharing efforts to invest in AI partnerships and licence data across the face of the industry. It is true that there are some neutral players – Clarivate Analytics for example and in some ways Digital Science – who are neutral to the knowledge production cycle and have hugely valuable metadata collections. They could be a vital building block in joint ventures with AI players, but their coverage is still narrow, and in the course of the last month I even heard a publisher say "I don’t know why we let Clarivate use our data – we don’t get anything for it!".
Of course, unless we share our data we are not going to get anywhere. And given the EU Parliament rejection of data metering and enhanced copyright protection last week all these markets are wide open for for massive external problem solving – who remembers Google Scholar? The solution is clear – we need a collaborative model for data licensing and joint ownership of AI initiatives. We have to ensure that data software entrepreneurs get a payback and that investment and data licensing show proper returns, just as Hernando rewarded the booksellers who collected his volumes all across Europe. In a networked world collaboration is often said to the the natural way of working. It is probably the only way that AI can be fully implemented by the scholarly communications world. Hernando died knowing his great scheme had failed. AI will succeed if it shows real benefits to research and those who fund it. As it succeeds it will find other ways of sourcing knowledge if those who commercially control access today are not able to find a way of leading the charge, and not dragged along in its wake.
Originally published on davidworlock.com On 13th July 2018