Versioning the Cloud

London Legion No. 33 Higman

It always amused me that when I searched for myself in Google Images (go on, everyone does it..) I'd find this unrepresentative picture of me from the days when I played ice-hockey.

But then recently it disappeared. The team website got revamped, and - since I haven't been in the team for years - my dodgy picture vanished.

A bit of experimentation with the WayBackMachine, though, turned up lots of content from the previous incarnations of the team site. The WayBackMachine has made a valiant effort to record snapshots of the entire Internet, going back to 1996. It's a bit hit-and-miss, but there's enough there that you can retrieve long-deleted contents, if you know what you're looking for.

There's a a wider problem, though, that's getting some attention at the moment - how do we preserve the Internet (or a snapshot at any given point in time) so that future enquiries about the "way things were" can be answered? (Here's Lynne Brindley of the British Library talking about Digital Heritage).

A more subtle problem, for semantic web enthusiasts, is this: if we're now working with a web of data, rather than a web of documents, how can we tell the exact version was of every piece of data that contributed to the results of a query? Data may be collated from any number of sources, some of which may be more reliable than others, so that - even on a given day - you may get varying results for the same query, depending on precisely which bits of information were available at that instant.

And the ontologies that describe the data may also change, so that the relationships between bits of data will have mutated. (See SemVersion for some thoughts on the way ontologies could be versioned).

If saving snapshots of the web of documents is difficult, then doing the same for the web of data will be an order of magnitude harder.

Just in case, then, I'm preserving my dodgy old picture here, for all time (until this blog gets deleted, anyway).