Without open standards in data, we can never be totally sure of the integrity and long-term viability. Here, Senior Engineering Manager, Roger Coram, explores the fragility of data, the risks of improper storage and the practical importance of maintaining open standards.
The fragility of data
“Data” and all things adjacent might be our stock-in-trade, (it’s right there in the name after all), but we don’t talk about it so much just for fun.
Although we definitely do that too.
Data is more than important, it’s vital. It is the means by which much of the impetus of the modern world is driven. Analytics and new-fangled ideas like machine learning are no longer the exclusive domain of enterprises with huge staff rosters and equally impressive bottom lines.
It is, however, incredibly fragile.
Data points are only useful insofar as they can be reliably accessed; the “integrity” and “accessibility” of the information security CIA triumvirate come into play here too.
That reliability depends on two things, the context (a hugely overloaded term which I’m taking to mean a multitude of things—more on that momentarily) in which they are stored and the means by which they can be accessed.
I take “context” here to mean the huge swathe of things that encompass data storage:
- the storage medium (always a fun topic at parties—who can forget the data integrity issues of early consumer SSDs?)
- the storage abstractions on top of that (RAID levels are another party-conversation starter)
- the file-format of choice (the abstraction by which we professional bit-flippers may coerce information into something meaningful)
- and—arguably most important and all-too-often forgotten—the legal circumstances in which the data arrived, often affecting those earlier choices and even the geography of one’s decisions.
I’m going to take a slight diversion by way of illustration: it was a poor choice of mine to write a draft of this piece in the online version of Microsoft Word.
For some reason—whether it’s an oddity in the asynchronous saving or some bizarre combination of keystrokes on my part—it’s taken to deleting or rearranging sections of this writing periodically.
Data loss. Data corruption. Me having to depend on the fragility of my memory: there’s a reason I write things down…
Risks of improper storage
I’m going to focus on considerations around file-formats. Specifically understanding the term as the means by which a “content type” (e.g., audio, video, text, etc.) may be stored to facilitate particular considerations, such as the need for rapid access versus long-term storage.
Somewhat generalising, a file format’s provenance is usually one of the following:
- proprietary: created by an organisation to meet a need. Often pertaining to a specific service offering and with the details of the format itself—the means by which content stored in this format may be persisted and retrieved—protected by some combination of licencing, patents, etc. but generally inaccessible to the wider public.
- open: created under various circumstances (sometimes by an organisation, an individual or some combination thereof). Usually for similar purposes but the key differentiator being that the details of the format are accessible (and often open to input/changes from) the wider public.
Similarly, there are two main considerations regarding the long-term viability of a file-format, regardless of its provenance:
- Obsolescence: as file-formats evolve, to add new functionality or adapt to changing requirements, older versions become obsolete or inaccessible altogether if backwards compatibility is not a consideration. Arguably, in the realm of Data Engineering where some file-formats are considerably more recent developments than, say, that of your favourite WYSIWYG text-editor, this risks here are harder to quantify.
- Proliferation: more mature domains invariably have fewer active formats. Experience and support requirements lead to more normalised behaviours but domains which are in active, rapid development invariably evolve a greater number of bespoke formats, each tailored to a more immediate, specific need.
The importance of Open Standards
The risks of the long-term viability of format—and therefore the data they hold—aren’t quite clear-cut as “open, good”, “proprietary, bad” despite claims to the contrary.
A superior open format may fail to gain traction or support for related tooling. A proprietary format, used under the terms of a support contract, may fall by the wayside as the company responsible goes out of business, pushes its latest alternative or does not consider support/updates financially viable.
Here, though, is the crux of the matter: what happens should any of the above or arise? What’s important is the ready availability of the standard: the specification for the format and its availability.
“Open” is something of a loaded term when used in proximity to anything pertaining to software.
In this context, I refer to a few characteristics which I would consider paramount: that the specification must be usable without restriction, that it must be readily accessible and that the development and maintenance must allow for broad contribution.
Arguably that last one is not necessary for the above examples (i.e. for supporting data held in a given format, it matters little whether one had a say in the development of that format) but it is of vital importance to the long-term viability of a format.
Something that could aid in avoiding the situation altogether: openly-developed standards are more relevant to the needs of a broader set of users.
Only if a standard, if the format in which data are held and to be accessed, is sufficiently open can we be assured of the integrity and long-term viability of said data.
Without that, our data exist at ever-increasing risk of one day becoming inaccessible and unfit for the purpose for which they were intended.