Digital Archiving Part 4: The Data About Your Archive Data
by Ken Wood on Aug 23, 2012
One of the things that is frequently overlooked in archiving is the catalog of the data in the archive. In fact, I refer to a suite of data about data in an archive as metadata (not to overload an already overloaded term, but it can’t be helped). Metadata is the broad term that I use to label all of the pieces of information to describe what is in the archive, with the exception of the actual data itself. I use metadata to refer to the catalogs, POSIX metadata, metadata, custom metadata and search indexes. The way that “I” describe these aspects of metadata goes like this (from a file and object store perspective):
- Summary information about a file(s) that includes location, names, light keywords, references, application, type, etc.
- POSIX Metadata*
- Data around a file — the filename, file path or path name, creation, modification dates & times, size, permissions and ownership, type, etc.
- Data about the data within a file, typically from its header which could include pixel resolution (height x width), color palette, color depth, application, a magic number, bitrates, etc.
- Custom Metadata
- User or machine-supplemented data, or other desired information about a file or the data within a file (e.g. associated weather conditions, location data**, geo-coordinates, hashes, camera type, comments, keywords and tags, system level data, thumbnails, other references, etc.)
- Keywords or other data components organized in a scan-able or searchable arrangement used for fast lookup and a summary of content.
* Typically, this is the metadata with which most file data is associated
** Location data in this respect is system location such as a URL or pathname
There may be some differences in opinion on the way I’ve defined these terms, but I like them and they are sufficient for going forward in this blog. The point is, there can potentially be a lot of data about data, without including the actual data in the in the overall count.
We all know that, as data becomes less interesting for any number of reasons such as–age, usefulness, importance, etc.—the access activity of these files diminishes as well.
There are also several reasons for archiving data, such as data preservation (want to), compliance (have to), and/or to get the unused data off of the expensive primary storage systems and to locate it somewhere more appropriate (either want to or have to, and need to). There are several terms thrown around to describe data that is no longer in use, but is deemed too important to delete entirely. My favorite is “long-tail data.” See the RED activity line level in the chart below.
Long-tail data usually starts out active, especially at creation time, and depending on what the data is, can be active for a long or short period of time. This data then goes through a slightly active phase , less active than at the time of creation including several references, maybe some minor updates. Finally, toward the end of this data’s usefulness, it will enter the long-tail phase of its lifecycle rarely being referenced, opened or used.
My last blog, for example, is a perfect example of an accelerated lifecycle for a piece of data. When I created it, I had it open and didn’t close it until I was done, or had to reboot my system. That file was active. I knew exactly where it was. It was listed on the top of my “Recently Opened” list. I never had to search for it. When I finished writing it, I reviewed it again, made some mild updates and changes. Emailed it for review and for posting on the blog site. It was done.
That blog is now officially long-tail data. I would consider it a short-term file with an accelerated lifecycle that lasted a couple of days at the most. Many files of enterprise projects can be active for months or years. Versioning is used to save the state of data that can become inactive, backups of data can be inactive and so forth, but there is an interesting phenomenon about long-tail data.
Looking back at the chart, you’ll see an inverse relationship between the long-tail line and the BLUE line which shows data about this data. The BLUE line shows the activity of the data about data (of course this is not a true logical inverse relationship), the metadata. The less active data becomes, the more referenced the metadata becomes. If you think about how you use data, this should make sense. As I described in my blog file example, the file location is known and is listed on the last recently opened list. The file is active. However, the more inactive the file becomes, the more I need to search for it, either by browsing through the file system or through a full system search. The data about this data is always referenced either in a search for other data or for this data itself. The data about this data becomes a part of every query from now on.
Now, scale this scenario up and out to an enterprise level archive. Data is constantly being ingested. This reminds me of one of my favorite one-liners–“a true data archive only gets bigger”. Part of this ingestion could include cataloging, metadata extraction or full-blown indexing in order to make data in the archive findable at some level.
Then again, this assumes that this is a “seamless” system for lifecycle management where data flows automatically through these systems and that all storage tiers are part of the same namespace.
There are archiving systems and data repositories that tend to be more active than long-tail data archives and that are purposely built to ingest data from the get-go. These archiving systems tend to be partitioned off from the rest of the operational and active systems and may not be part of the same, overall “namespace” of which the active data is a part of. Actually, the flow in this case is reversed in that data is pulled from these repositories and staged into an operational state to be processed. Then again, searching for the right data to be “pulled” is also a big part of the system and process.
I’d love to hear from you, what’s your opinion on my use of these terms? How important is data about data in your environment?