Rethinking Unstructured Information
by Michael Hay on Jun 26, 2008
Over the past several years governmental organizations, commercial ventures and educational
institutions have all been “waking up” to the fact that electronically stored information is something that society in general has to govern and protect. You can see the trends of it all around if you look. For example, the recent tussle over the standardization of OpenXML versus the Open Document Format. Both of these standards are essentially after the same thing, a self describing data format which basically puts the power of data ownership, structure and format in the hands of both their owners today and far in the future. It is almost as if all at the same time governments, end users, and corporations are realizing that the millenniums of time spent figuring out how to preserve, protect and ensure the authenticity of paper records now must urgently be applied to the digital world. In a sense it is as if the human consciousness is waking up to the fact that we need to apply a disciplined approach to all that Electronically Stored Information (ESI) that’s lying around.
Another point of evidence comes from the body of court cases and new regulations all of which put pressure on companies to look again at their unstructured information with fresh eyes. I’m sure that everyone reading this knows about ENRON or perhaps may even remember Sarbanes-Oxley which is effectively the United States government’s attempt to increase both accountability and transparency in the financial markets. Of course there is also the urban legend that Kenneth Lay was so guilty that he died of a self induced heart attack before his trial could be completed — and if it wasn’t one it sure is now. However, while dramatic and definitely in the news, when compared to the Federal Rules of Civil Procedure (FRCP), SOX is “chump change.” My reason for stating this comes from both the ambiguity and the broad applicability of the FRCP. Applicability includes for any organization’s main parts in addition to any agents, subsidiaries, or affiliates and even those who are overseas. Further it is also applicable to any organization (public companies, private companies, and educational institutions) who can get into Federal court for some reason. The only thing that the FRCP really says is that organizations have to be prepared for a litigation.
- A records retention policy is really required and more importantly adherence to it
- The ability to quickly put items on “litigation hold” is required
- With the records retention policy in effect it is also a good idea to provide evidence that your organization is following the policy through potentially documented audits
In short if you can say to yourself that you’ve documented what you are supposed to do, you are doing it and you can prove it you are in good shape.
Now there has been a lot written about the FRCP in 2006, 2007 and even now, and I’m sure by this time your saying like, what the, you’re not saying anything new here. If you have come to that conclusion you’d be right, however I was using this point to illustrate that we as a culture are now considering how to protect and preserve, from a real archival perspective, ESI. Essentially, we need to look at the paper world as an example of what not to do, learn the lessons and apply them to ESI. I do want to point out something fairly unique here: namely that when companies are forced into looking at things through regulation it can often lead to efficiencies that they had not thought of in the past. For getting the unstructured world in order, I firmly believe that we will need to develop unstructured reporting tools which don’t really exist today. Instead, a mishmash of content management systems has moved into center stage as the thing to fix this critical issue. Well, I think that logical approach is doomed from the beginning. One of the first things to consider, if content management could solve this problem it would already have. The issue that I see with content management systems is that a priori organizations have to think of all possible outcomes for work flow, structure, permissions and hierarchy. Pointedly that logical approach is flat wrong in two ways.
- The world of unstructured information is a very messy one and vast and people largely don’t know or understand the contents or even what they have, so how can an upfront strategy work if scale and scope aren’t really known?
- Building a structure with messy information is really an emergent property of people interacting with one another and their information, not a premeditated action.

I do want to provide an example of where point in time observation can lead to incorrect logical assumptions about premeditation. Specifically,ants gathering food. If one were to look at a fully formed ant trail where the ants were clearly taking food from a source and bringing it back to the nest, one could wrongly assume that there was an intelligent controller “directing” the ants towards getting the food and bringing it back to the nest. This assumption would be completely wrong and assumes the controller had a premeditated thought directing the ants to complete the task. In actuality the ants have some relatively simple programming and social interaction on their side. Essentially we can think of every ant as knowing about food, knowing how to get back home, knowing how to leave scent trails, and knowing how to follow scent trails. When they run into food they combine these simple programs to gather the food and return to home while leaving scent marks. If one adds a randomly distributed set of ants across the field then it is probable that many ants will run across the scent trail, follow it back to the food, and return to the next while strengthening the trail. Repeat this process over many ants and an organized ant trail emerges. So in the natural world this is a case of organization being an emergent property of time and social interaction, and not something thought of a priori.
Another reason why premeditated application of structure fails the giggle test: there is an entire generation of workers (like almost 80 million) entering the workforce and they are trained by Google and others in the use of modern unstructured information reporting tools such as blogs, RSS, Wikis, digg, etc. In essence it is through the social interaction between these people that their structure emerges. So my point is that we as IT professionals today need to start bringing these tools into our companies and teams so that we can start speaking early to these new comers to the work force. I also firmly believe that as we add these tools into our bag of tricks, we will be solving working with unstructured information and remedying the failed experiment of content management.
Comments (4 )
Asim on 16 Jul 2008 at 7:52 am
Michael – interesting hypothesis, you may be on to something. Seems chaos theory may be implied in your assessment of the workplace. Order resulting (not by design, generally by accident or inertia) from workflow randomness. Technology products may and will have a hard time being force-fit into this scenario. I agree IT pros need to be cognizant of the new workforce and their methods of communication and behavior. Kids today, whaddaya gonna do…..
mhay on 17 Jul 2008 at 9:07 am
Bingo, you got my point exactly. The next generation of workers is already rethinking what kinds of technology will be used in the workplace. Further other things like peer to peer trust relationships are an integral part of social networking technologies and people that make use of them. Emergence of order from Kaos (pun intended) is something we will all have to cope with.
Christophe Bertrand » Blog Archive » Vote for Joe the Storage Admin on 03 Nov 2008 at 1:48 pm
[...] First of all, Joe is a storage guy (or gal) who’s getting yelled at by the server people because he can’t provision LUN fast enough. He used to have a few Terabytes under management only a few years ago, and what he used to provision in a year is now what he provisions in a month… sometime a week. He’s over-burdened, over-taxed (just like the plumber guy – although that was not really true), under-appreciated. One thing he knows for sure is that there’s going to be more and more data to manage, structured and unstructured, and that his boss’s boss has limited budget to buy more of anything, let alone enough to fund additional headcount. [...]
Michael Hay » Blog Archive » How is HCAP being used? on 28 Apr 2009 at 3:07 pm
[...] activities (not surprisingly as we’ve talked about that in the past with respect to the FRCP and other fun compliancy things) moving content to this less costly tier, reducing backup windows [...]


