The hidden complexity of data visualization
I don’t mind admitting that I’m in two minds about data visualization. Because just at the moment, with this particular subject, I think that’s the logical place to be.
You see, on the one hand – or perhaps I should say, in mind number one (which in this case is a Brian-Cox-enthusing-about-the-universe sort of mind) the possibilities are fantastically exciting. But with my more practical head on, I see a lot of complexity and uncertainty with which we have yet to engage in any realistic and detailed way. Beyond the generalities, it’s clear there is thinking to be done.
It’s not that I don’t believe there really are opportunities here for publishers, I certainly do. It’s more that we haven’t yet properly sized those opportunities in any but the broadest terms. So here’s a brief look at the opportunities and the issues that will hopefully get you as excited (and as circumspect) as I am about data visualisation.
Reproducibility and reuse
Two important principles of the scholarly system come into play when looking at the opportunities for adding value through data visualization: reproducibility and reuse.
The move to make research data publically accessible is all about reproducibility. This central tenet of scientific method says that if Scientist A gets a certain result in an experiment, then Scientist B, if she follows the same methods, will get the same result. Similarly, if Economist A analyses a certain data set and draws a certain inference from it, then Economist B using the same analytical tools ought to be able to follow the ‘working out’ that led to that inference being drawn (though they might not agree on interpretation).
It’s all about having a traceable, tractable link between data and the conclusions drawn from that data. In experimental terms, the same inputs should always produce the same outputs.
The obvious problem with visualizations of data is that the link between the data and the visualization is not always very tractable. Connecting the actual raw data to the visualization can be a bit of a missing link – a potential point of weaknesses in the scholarly ‘audit trail’ if you like. Researcher A might have just made up the diagram in her paper with Photoshop (other graphics programs are available): how does researcher B know it is actually a diagram of the data?
This is part of the rationale for services such as Figshare, which provide not only repositories for data but also tools for visualisation. Where visualisations of data can be derived automatically, by turning the dials in the right direction, conceptually speaking, then you have a completely tractable link between the source data and your visualization. Job done.
Slightly more contentious, though no less prominent in the wording of funder mandates concerning research data, is the question of reusability. This is one of the key drivers for including data in Open Access mandates.
In the first place research data is expensive to produce, and funders are not keen on giving people money to reinvent the wheel – i.e. to recreate data sets they have already paid someone else to create. Moving from the negative to the positive side of the balance sheet, reuse of data also holds the promise of new insights and discoveries.
Science abounds with examples of discoveries that the original researcher had no idea what to do with at first: it required someone else from a different field, with a different perspective, or someone ten years into the future with more advanced knowledge, to put that idea to practical use – essentially, reusing the same data for another purpose.
Funders hope that more reuse of data will serve to maximize the long-term value derived from their investments in funding.
How visualisation can help here is in providing tools for researchers to play with the data; something that will encourage reuse. And with the right type of tools, again, there is clear and tractable link back to the data, enabling new publications around these new insights and discoveries.
Hunt the driver
So in aligning with the aims of funder mandates around these two principles, providing the right kind of tools for data visualization seems like a good thing for publishers to do. There are business imperatives here.
But are they actually imperatives – or just nice-to-haves? After all: mandates around data are fairly explicit about what researchers are supposed to do: they have to make their data available and accessible, but none of them mention anything about providing visualization tools, do they?
Not explicitly, no. However, I would argue that any requirement to maximise public access to data contains an implicit driver for data visualization. And here there is a paradox in my argument – because to explain why I think the driver is there, I also have to bring up the worrying complexity I mentioned earlier.
Comparing apples and figs
The fundamental problem here is that not all data is/are created equal. Even leaving aside the humanities and social sciences for the moment and looking across the many discrete areas of specialization within STM, we see a fantastic diversity in data types (it’s certainly not all about what you can do with Excel spreadsheets).
Even before we get to the issue about what file types people use to store data, there is the issue of standards for recording data. The issue being that there aren’t any, outside of individual disciplines. And even within subject areas, there are problems.
Take climate change, for instance, an area where the data is of vital public interest. There are standard ways of measuring temperatures, air pressure, wind speeds and directions; and those standards have been developed and have evolved over a very long time. But despite the fact that it is a very established field and one in which people have been thinking very carefully about how to standardise measurements, you can still take sets of climate data from different places around the world at different points in history and find that they are not compatible with each other, because of various subtleties about how they are measured – meaning that you end up comparing apples with pears (or even figs). This would be a problem for any set of generic tools you tried to develop to visualise climate data.
And climate, as I say, is a well-established subject area. As you move towards the cutting edge of science, the problem becomes more acute, because the data you are generating is itself so cutting edge, so unprecedented; in its shape and sometimes in its profusion, that the usual means of recording data won’t cope, and new tools and standards have to be invented on the fly.
The Large Hadron Collider generates petabytes of data with every experiment, most of which is simply thrown away because they can’t even store it (obviously, they keep the stuff that looks interesting). The scientists at CERN have had to design their own file formats to deal with it, because they are facing a new problem: how do you capture all the things that are happening inside a particle accelerator?
We’re all in some way familiar with this phenomenon because it has a name – Big Data – and a high media profile; but in a sense, Big Data is just a subset of a much larger phenomenon, which is the fantastic profusion, diversity and novelty of the data produced now within almost every sphere of human activity touched by digital – including not only academic enquiry but also sport, finance, weather, social media, agriculture … you name it.
This data comes in all shapes and sizes. And once the stuff is generated, there is immediately the problem of how to record, store, retrieve, analyse and visualize it. And here’s bad news (if it really is news): there is no set of generic standards, formats, file types and tools that can handle it all. And because science is all about enquiry and novelty, it can’t wait around for the tools to arrive, for the standard to be set. So the problem is being solved on a subject-by-subject and even case-by-case basis. On the fly. Here’s a new type of data, lets write a new piece of software to store it with. Pretty soon you have 400 different file types within a single narrow academic specialization.
The software is the data
What we’re seeing, on the cutting edges of science, is the intensification of an effect that is also present (though not much acknowledged) in the mainstream of scholarly enquiry. The Reinhart/Rogoff takedown in Economics showed that not only the data has to be present for reproducibility, but also the software with which it was analysed – in this case a poorly formatted Excel spreadsheet. In fact, I’d challenge readers to conceive of a data set that doesn’t require some sort of software ‘container’ or ‘wrapper’ to make it accessible.
Looked at from a certain point of view, that of the machines, a spreadsheet is no more nor less than a data visualization tool: who other than machines ever really looks at ‘raw’ data?
Which comes to my point about funder mandates and data. The use of a data repository, it would seem, is a necessary condition for compliance – but is it a sufficient one?
There are around a thousand data repositories in the world, but very few of them are like Figshare in offering visualisation tools. In they main, what they do is take in your data file and look after it, but not much apart from that. They will probably catalogue it to show data and provenance, capture standard metadata, and from there the job is all about preservation and governance. But apart from that, the contents of the file are a black box to the repository. There might be 40,000 different data points in there that are all in 12 dimensions, with 13 different types of units being measured – but who knows that? And the danger is, with any data set that does not use a standard piece of generic software, it remains a black box to other researchers too.
So OK, you’ve drawn up a data plan and you’ve deposited your data, fair enough. You’ve ticked the box. But have you made it accessible? That is a whole other area of difficulty. When you look at it more closely, you find that making data accessible is intimately bound up with the ability to visualize data.
Sizing the orchard
But let’s not seduce ourselves with the glamour of gloom. At a subject level, these problems are solvable. We have seen examples of startups that are taming visualisation problems as they arise, in fields such as microscopy (Glencoe) (https://glencoesoftware.com/index.html) and image processing (IPOL). Publishers with a big presence in a particular area of specialization are well-placed to take on such data visualization challenges, with the participation of their researcher communities, who might welcome the chance to spend their APACS with someone who can provide better tools than they can come up with themselves.
And in the wider context, there may well be a large enough commonality of need across disciplines to support the development of generic tools that will satisfy the great majority of the need. Figshare, Glencoe and others like them might be nibbling around the edges of the problem at the moment; grazing the low-hanging fruit – but once we have charted the full extent of the orchard, it might well be the case that a very sizeable proportion of the harvest hangs fairly low.
It is undoubtedly also the case, however, that when it comes to data visualisation, one size will not fit all.
There is a lot of uncertainty in the area of data and data visualization at the moment, and where there is uncertainty there is risk. But where there is risk – as I think I have probably pointed out before on this blog – there is often a big potential for reward.